# Extended Gene Correlation Analysis with IC50

In addition to the correlation between genes and IC50 values, we will explore the following:
1. **Correlation Matrix of Genes**: Visualize pairwise correlations between genes.
2. **Correlation Between PCA Components and IC50**: Analyze how PCA components correlate with IC50.
3. **Heatmap of Gene-IC50 Correlations**: Plot a heatmap showing the correlation between the most correlated genes and IC50.
4. **Gene-to-Gene Correlation**: Explore the correlation between different genes.


In [2]:
import os
import polars as pl
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
# Load final dataset
final_data = pl.read_parquet("../../data/pseudo_bulk/gdsc_single_cell_aligned.parquet")


: 

In [None]:
# Select gene columns (assuming columns after 3rd are genes)
gene_columns = final_data.columns[3:]

# Calculate correlation between genes and IC50
correlations = final_data[gene_columns].to_pandas().corrwith(final_data["LN_IC50"].to_pandas()).sort_values()

# Save top correlated genes (both positive and negative)
correlations_df = pd.DataFrame({
    "Top Positively Correlated Genes": correlations.tail(10).index.tolist(),
    "Top Negatively Correlated Genes": correlations.head(10).index.tolist()
})

correlations_df.to_csv("statistics/gene_ic50_correlations.csv", index=False)
print("📂 Gene-IC50 correlations saved to 'statistics/gene_ic50_correlations.csv'")


In [None]:
# Correlation Matrix of Genes
gene_corr = final_data[gene_columns].to_pandas().corr()

# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(gene_corr, cmap="coolwarm", annot=False, fmt=".2f", linewidths=0.5)
plt.title("Gene-to-Gene Correlation Matrix")
plt.savefig("statistics/gene_correlation_matrix.png")
plt.show()
plt.close()


In [None]:
# Select PCA components (assuming columns start with 'PC')
pca_columns = [col for col in final_data.columns if col.startswith('PC')]

# Calculate correlation between PCA components and IC50
pca_corr = final_data[pca_columns].to_pandas().corrwith(final_data["LN_IC50"].to_pandas()).sort_values()

# Display the top 5 most positively and negatively correlated PCA components
print("Top Positive PCA Correlations with IC50:")
print(pca_corr.tail(5))
print("\nTop Negative PCA Correlations with IC50:")
print(pca_corr.head(5))


In [None]:
# Get the top 10 positively correlated genes
top_genes = correlations.tail(10).index.tolist()

# Extract gene expressions for the top correlated genes
gene_ic50_df = final_data[top_genes + ["LN_IC50"]].to_pandas()

# Calculate correlation matrix between the top genes and IC50
gene_ic50_corr = gene_ic50_df.corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(gene_ic50_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Top Gene-IC50 Correlation Heatmap")
plt.savefig("statistics/gene_ic50_correlation_heatmap.png")
plt.show()
plt.close()


In [None]:
# Plot the pairwise gene-to-gene correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(gene_corr, cmap="coolwarm", annot=False, fmt=".2f", linewidths=0.5)
plt.title("Pairwise Gene Correlation Matrix")
plt.savefig("statistics/gene_pairwise_correlation_heatmap.png")
plt.show()
plt.close()


### Conclusion

In addition to the gene-IC50 correlation analysis, we have:
1. Generated a **correlation matrix** of genes to identify redundant features.
2. Explored the **relationship between PCA components and IC50**.
3. Created a **gene-IC50 correlation heatmap** to highlight important genes.
4. Visualized **gene-to-gene correlations** for better understanding of feature interactions.

- **Next Steps**: Proceed with feature variability analysis to identify the most variable genes across the dataset.
