## Filtering Vitamin D-related compounds from LINCS metadata
This notebook filters all vitamin D-related compounds, experimental instances, and transcriptomic signatures using the LINCS L1000 metadata files:` compoundinfo_beta.txt`, `level5_beta_trt_cp_n720216x12328` ,`instinfo_beta.txt`, and `siginfo_beta.txt`.

In [None]:
# Import libraries for data loading, filtering, and basic preprocessing
import pandas as pd
from cmapPy.pandasGEXpress import parse_gctx

### Loading Metadata

This section loads three `.txt` files containing key metadata from the LINCS L1000 dataset:

- `compoundinfo_beta.txt`: includes compound names, IDs, and other chemical metadata.
- `instinfo_beta.txt`: provides information about experimental instances, including dose, exposure time, and conditions.
- `siginfo_beta.txt`: contains transcriptomic signature metadata, such as cell line and instance identifiers.

These metadata tables are used to filter signatures related to vitamin D compounds.

In [None]:
# Load metadata files from raw_data (gzipped and renamed)
comp_df = pd.read_csv("../raw_data/compoundinfo_beta.txt", sep="\t", low_memory=False)
inst_df = pd.read_csv("../raw_data/instinfo_beta.txt", sep="\t", low_memory=False)
siginfo_df = pd.read_csv("../raw_data/siginfo_beta.txt", sep="\t", low_memory=False)


### Identifying Vitamin D-related Compounds and Experimental Conditions

We perform a broad keyword search across all compound metadata fields to identify entries related to vitamin D or its analogs. 

After extracting the corresponding `pert_id` values, we filter the experimental conditions (`instinfo_beta.txt`) and summarize:

- The most frequent compounds
- The most represented cell lines
- The most common exposure times

This sets the foundation for selecting a biologically meaningful and well-supported subset for transcriptomic analysis.

In [None]:
# Step 1: Perform a wide keyword match across all compound columns to capture vitamin D-related entries
mask = comp_df.apply(
    lambda row: row.astype(str).str.contains("vitamin d|vdr|calcitriol|calciferol", 
                                             case=False, na=False).any(), 
    axis=1
)

# Step 2: Extract unique pert_id values for all matched vitamin D-related compounds
vdr_pert_ids = comp_df[mask]["pert_id"].unique()

# Step 3: Filter experimental conditions that involve any of the selected vitamin D compounds
subset_vdr = inst_df[inst_df["pert_id"].isin(vdr_pert_ids)]

# Step 4: Summarize experimental conditions by compound, cell line, and exposure time
print(f"Total experimental conditions: {len(subset_vdr)}")

print("\nMost common compounds:")
print(subset_vdr["cmap_name"].value_counts().head(10))

print("\nMost common cell lines:")
print(subset_vdr["cell_iname"].value_counts().head(5))

print("\nMost common exposure times:")
print(subset_vdr["pert_time"].value_counts().head(5))

From the 5,919 experimental conditions involving Vitamin D-related compounds:

- **Compounds**: The top 7 most represented are:
  - *Calcitriol, Calcipotriol, Maxacalcitol, Seocalcitol, Ercalcitriol, Tacalcitol, Paricalcitol*
  - These analogs are clinically or experimentally relevant, covering dermatology, oncology, nephrology, and experimental VDR modulation.
  
- **Exposure time**: 
  - *24 hours* is by far the most frequent condition (4,352 out of 5,919), ensuring consistency for transcriptomic comparisons.

- **Cell lines**: 
  - The 5 most represented (*MCF7, A549, PC3, HA1E, U2OS*) span epithelial, lung, prostate, kidney, and bone origin, offering biological diversity while maintaining data density.

> Based on this, we define our core subset using:  
> **7 compounds × 5 cell lines × 24h exposure**  
This balances representativeness, biological relevance, and statistical robustness.


In [None]:
# Cross-tab of compound × cell line at 24h to check coverage before subsetting
subset_24h = subset_vdr[subset_vdr["pert_time"] == 24.0]
pd.crosstab(subset_24h["cmap_name"], subset_24h["cell_iname"]).loc[
    ["calcitriol", "calcipotriol", "maxacalcitol", "seocalcitol",
     "ercalcitriol", "tacalcitol", "paricalcitol"],
    ["MCF7", "A549", "PC3", "HA1E", "U2OS"]
]

### Subset design decisions

To ensure a representative and balanced experimental subset, we checked coverage across the 7 selected vitamin D-related compounds and the 5 most frequent cell lines, restricted to 24h exposure.

All compound–cell line combinations are present except:
- **Maxacalcitol–U2OS**: 0 conditions
- **Paricalcitol–U2OS**: 0 conditions

We chose to **keep U2OS** in the final subset, as:
- It is well represented across other compounds (e.g., Calcitriol, Calcipotriol)
- It provides biological diversity (bone origin, relevant to vitamin D)
- Slight gaps in factorial coverage are acceptable in real-world HTS data

> Final design: **7 compounds × 5 cell lines × 24h exposure** (minus 2 missing combinations)

In [None]:
# Define selected compounds, cell lines, and exposure time
selected_compounds = [
    "calcitriol", "calcipotriol", "maxacalcitol", "seocalcitol",
    "ercalcitriol", "tacalcitol", "paricalcitol"
]
selected_cells = ["MCF7", "A549", "PC3", "HA1E", "U2OS"]

# Filter dataset
subset_final = subset_vdr[
    (subset_vdr["cmap_name"].isin(selected_compounds)) &
    (subset_vdr["cell_iname"].isin(selected_cells)) &
    (subset_vdr["pert_time"] == 24.0)
].copy()

# Save to CSV
subset_final.to_csv("../processed_data/instinfo_vitD_subset.csv", index=False)

# Summary
print(f"Final subset saved: {len(subset_final)} conditions")


In [None]:
# Most common doses per compound
dose_summary = subset_final.groupby("cmap_name")["pert_dose"].value_counts().unstack().fillna(0).astype(int)

# Display dose counts by compound
dose_summary

### Notes on dose selection

Doses in the LINCS dataset are highly variable, including unusual values (e.g., 0.769231 or 2.307690 µM), which likely result from compound-specific adjustments.

For now, we do **not treat dose as a primary variable**, but we retain all conditions to preserve sample size.

> In future analyses, we may group doses into:
> - **Low dose**: < 1 µM  
> - **High dose**: ≥ 1 µM  
to reduce noise and account for potential nonlinear responses.

In [None]:
# Create a unique merge key in both dataframes
subset_final["merge_key"] = subset_final["pert_id"] + "_" + subset_final["cell_iname"] + "_" + subset_final["pert_time"].astype(str)
siginfo_df["merge_key"] = siginfo_df["pert_id"] + "_" + siginfo_df["cell_iname"] + "_" + siginfo_df["pert_time"].astype(str)

# Filter signature metadata to match selected experimental conditions
siginfo_vitD = siginfo_df[siginfo_df["merge_key"].isin(subset_final["merge_key"])].copy()

# Summary
print(f"Signature entries matched: {len(siginfo_vitD)}")

In [None]:
# Filter signatures with at least 3 replicates
siginfo_vitD_clean = siginfo_vitD[siginfo_vitD["nsample"] >= 3].copy()

# Save filtered signature metadata
siginfo_vitD_clean.to_csv("../processed_data/siginfo_vitD_filtered.csv", index=False)

# Save list of retained sig_id values
siginfo_vitD_clean["sig_id"].to_csv("../processed_data/sig_ids_vitD_filtered.txt", index=False, header=False)

# Summary
print(f"Signatures retained after filtering: {len(siginfo_vitD_clean)}")

To ensure data quality, we filtered the 422 matched signatures to retain only those with **at least 3 biological replicates** (`nsample ≥ 3`).

This threshold improves the robustness and reproducibility of downstream analyses by excluding noisy or weakly supported signatures.

- **Signatures retained**: 258  
- **Signatures excluded**: 164 (due to low replicate count)

> The final list of `sig_id` values was saved and will be used to extract expression data from the Level 5 matrix.


In [None]:
# Load the list of filtered sig_ids (one per line, no header)
sig_ids = pd.read_csv("../processed_data/sig_ids_vitD_filtered.txt", header=None)[0].tolist()

# Path to the Level 5 GCTX expression matrix
gctx_file = r"../raw_data/level5_beta_trt_cp_n720216x12328.gctx"

# Parse the GCTX file and load only the selected columns (signatures)
gctoo = parse_gctx.parse(gctx_file, cid=sig_ids)

# Extract the expression matrix (rows: genes, columns: sig_ids)
expression_df = gctoo.data_df

# Save matrix to CSV (can be large)
expression_df.to_csv("../processed_data/vitD_expression_matrix.csv")

# Print shape for confirmation
print(f"Expression matrix shape: {expression_df.shape}")

### Summary and Next Steps

- 258 clean signatures related to vitamin D across 5 cell lines and 7 compounds.
- Expression matrix (`vitD_expression_matrix.csv`) saved for downstream analysis.
- Next step: Exploratory Data Analysis (EDA) of the expression profiles.