# Feature Variability Analysis

In this notebook, we analyze the variability of gene expression across the dataset. High variability in gene expression may indicate that the gene plays a crucial role in drug response prediction, while low variability may suggest the gene is less informative.

We will calculate the **standard deviation** of gene expression across cell lines and identify the most variable genes.


In [1]:
import os
import polars as pl
import pandas as pd


In [2]:
aligned = pd.read_parquet("../../data/bulk/rnaseq_fpkm.parquet")
transposed_df = aligned.set_index(aligned.columns[0]).transpose()

# Ensure all values are numeric and fill NAs with zeros or a small value
transposed_df = transposed_df.apply(pd.to_numeric, errors='coerce').fillna(0.0)

# Reset index to turn cell line names into a column
transposed_df.index.name = "SANGER_MODEL_ID"
transposed_df.reset_index(inplace=True)

# Convert back to Polars
cell_gene_matrix = pl.from_pandas(transposed_df)

# Drop unwanted columns
cell_gene_matrix = cell_gene_matrix.drop(["model_name", "dataset_name", "data_source", "gene_id"])

print("Transposed gene expression data to shape: rows = cell lines, cols = genes")
print(f"Shape: {cell_gene_matrix.shape}")
cell_gene_matrix.head()
cell_gene_matrix = cell_gene_matrix.slice(1)

non_gene_cols = ["SANGER_MODEL_ID"]
gene_columns = [col for col in cell_gene_matrix.columns if col not in non_gene_cols]

Transposed gene expression data to shape: rows = cell lines, cols = genes
Shape: (1431, 37603)


In [3]:
# Select gene columns (assuming columns after 3rd are genes)
gene_columns = cell_gene_matrix.columns[3:]

# Calculate standard deviation for each gene
gene_std = cell_gene_matrix[gene_columns].to_pandas().std().sort_values(ascending=False)

# Save the most variable genes
gene_std_df = pd.DataFrame({"Gene": gene_std.index, "Standard Deviation": gene_std.values})
gene_std_df.to_csv("statistics/most_variable_genes.csv", index=False)

print("📂 Most variable genes saved to 'statistics/most_variable_genes.csv'")


📂 Most variable genes saved to 'statistics/most_variable_genes.csv'


### Conclusion

We have calculated the standard deviation of gene expression values across cell lines, identifying the most variable genes in the dataset. These genes may be important for predicting drug sensitivity.

- **Next Steps**: Proceed with dimensionality reduction and visualization to further explore the relationships between genes.
