# Exploratory Data Analysis of Vitamin D-Related Transcriptional Signatures

This notebook presents the exploratory data analysis (EDA) of transcriptional responses to vitamin D and its analogs, based on data from the LINCS L1000 dataset. The dataset includes gene expression signatures from human cell lines exposed to different vitamin D-related compounds across various doses and time points.

The goal of this analysis is to characterize the diversity and distribution of the selected signatures, examine the experimental conditions (cell lines, doses, exposure times), and identify preliminary patterns in gene expression profiles that may inform subsequent modeling and biological interpretation.

In [None]:
# Import libraries for data manipulation, visualization, and dimensionality reduction
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
# Configuración de visualización
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 6)

In [None]:
# Paths to pre-filtered files
EXPR_PATH = "../processed_data/vitD_expression_matrix.csv"
META_PATH = "../processed_data/siginfo_vitD_filtered.csv"

# Load expression matrix (genes x signatures)
exp_df = pd.read_csv(EXPR_PATH, index_col=0)

# Load metadata associated with the signatures
meta_df = pd.read_csv(META_PATH, index_col="sig_id")

# Display shapes
print("Expression matrix shape:", exp_df.shape)
print("Metadata shape:", meta_df.shape)


In [None]:
# Confirm all signature IDs in metadata are present in the expression matrix
assert set(meta_df.index).issubset(set(exp_df.columns)), "Some sig_id in metadata are missing from expression matrix"

# Reorder expression matrix columns to match metadata index
exp_df = exp_df[meta_df.index]

# Final confirmation
print("Column order aligned:", all(exp_df.columns == meta_df.index))

In [None]:
# Display a subset of the expression matrix to inspect gene-signature structure
display(exp_df.iloc[:5, :5])

# Display the first rows of the metadata to examine available annotations
display(meta_df.head())

# Calculate the number of unique compounds present in the filtered dataset
n_compounds = meta_df["cmap_name"].nunique()

# Summarize the number of transcriptional signatures available per compound
compound_counts = meta_df["cmap_name"].value_counts()
display(compound_counts)