## Based on Arda instructions after I tried to do clustering though Kmeans samples

Arda: So I think you should not do Kmeans clustering, 1) the umap space is not a good space to do any analysis on just for visualization. For the gene expression space, K-means will not be optimal either due the number of features etc.. So your strategy should be; 1)for over 100(0) iterations: subsample both cells and genes by %90, cluster using leiden clustering with default parameters, keep track of which pair of cells are clustered together. 2) from the 100(0) iterations, calculate the frequency of pairwise occurrence, and do a final hierarchical clustering using silhouette as a metric. 3) Using the identified clusters, create a PAGA graph and subsequently create a UMAP using the PAGA initialization. The third step will give you a nice visualization overlapping with your clusters identified from 1-2
This will give you a robust clustering

ChatGPT was used for initial code outline but modified to simplify computational usage e.g. through Arda'a idea of doing a dot product of one-hot encoded versions



In [3]:
import scanpy as sc
import numpy as np
import pandas as pd
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage, dendrogram

In [4]:

# ... Load your AnnData object (adata) ...

adata= sc.read_h5ad("adata_filtered_combined_feb2025.h5ad")

In [None]:


n_iterations = 100  # Increased iterations for robustness
pairwise_cooccurrence = np.zeros((21784, 21784))


for i in range(n_iterations):
    # Subsample cells and genes (90%)
    cells_to_keep = np.random.choice(adata.obs_names, size=int(0.9 * len(adata.obs_names)), replace=False)
    genes_to_keep = np.random.choice(adata.var_names, size=int(0.9 * len(adata.var_names)), replace=False)
    adata_subsampled = adata[cells_to_keep, genes_to_keep].copy()

    # Leiden clustering (default parameters)
    sc.pp.neighbors(adata_subsampled)
    sc.tl.leiden(adata_subsampled)

    # Track pairwise co-occurrence -- using one hot encoding!!!
    test = pd.get_dummies(adata_subsampled.obs['leiden'])
    #print(test)
    a = np.dot(test,test.T) #dot product counts number of times cells are paired together
    #print(a.shape)
    #print(a)
    pairwise_cooccurrence = pairwise_cooccurrence + a #add to existing counts


In [None]:
2. Frequency Calculation and Hierarchical Clustering:

python
Copy
# Convert to a distance matrix (1-normalized co-occurrence)
cooccurrence_matrix_normalized = pairwise_cooccurrence.values / n_iterations
distance_matrix = 1 - cooccurrence_matrix_normalized

# Hierarchical clustering using ward linkage (can experiment with other methods)
linkage_matrix = linkage(distance_matrix, method='ward')

#The silhouette score is not suitable as a metric for hierarchical clustering, since it's a metric to asses cluster quality
#You could calculate the silhouette score on the final clusters after hierarchical clustering

#You can determine the number of clusters by looking at the dendrogram (see below)
#or by using some other metric (e.g. calculating the silhouette score for different numbers of clusters and selecting the one with the best score)
dendrogram(linkage_matrix)
plt.show()
#Cut the dendrogram to get a certain number of clusters
cluster_labels = fcluster(linkage_matrix, t=4, criterion='maxclust') #Example: 4 clusters
adata.obs['consensus_cluster'] = cluster_labels
