## Portfolio Assignment week 02 (k-means & spectral clustering)
### Manifold learning

K-means and spectral clustering are popular algorithms used for clustering analysis. K-means is a simple and efficient method that aims to partition data points into K distinct clusters based on their similarity. It iteratively assigns each data point to the cluster with the nearest mean and updates the cluster centers until convergence. K-means is widely used due to its scalability and ease of implementation.

On the other hand, spectral clustering is a graph-based clustering algorithm that leverages the spectral properties of the data. It transforms the data into a graph representation and performs dimensionality reduction using techniques such as eigenvalue decomposition or graph cuts. Spectral clustering effectively captures complex relationships and can handle non-linearly separable data.

Both K-means and spectral clustering have their strengths and are applied in various domains, depending on the nature and structure of the data.

I'll check which one is better at clustringing the liver cancer data.
This data set includes two forms of liver cancer and the gene expression levels of those cancers.

In [1]:
# needed modules
import pandas as pd
import yaml
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.cluster import KMeans
from sklearn.cluster import SpectralClustering


## Steps
steps we are going to take to compare the two clustering methodes:
- loading in the data
- check NAN values/ distibution/ correlation
- kmeans and spectrtal clustering 
- conclusion

In [2]:
# importing the data with yaml safe_load
with open("config.yml") as config:
    input_files = yaml.safe_load(config)
    df = pd.read_csv(input_files["liver"])

df.head()

Unnamed: 0,samples,type,1007_s_at,1053_at,117_at,121_at,1255_g_at,1294_at,1316_at,1320_at,...,AFFX-r2-Ec-bioD-3_at,AFFX-r2-Ec-bioD-5_at,AFFX-r2-P1-cre-3_at,AFFX-r2-P1-cre-5_at,AFFX-ThrX-3_at,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at
0,GSM362958.CEL.gz,HCC,6.801198,4.553189,6.78779,5.430893,3.250222,6.272688,3.413405,3.37491,...,10.735084,10.398843,12.298551,12.270505,3.855588,3.148321,3.366087,3.199008,3.160388,3.366417
1,GSM362959.CEL.gz,HCC,7.585956,4.19354,3.763183,6.003593,3.309387,6.291927,3.754777,3.587603,...,11.528447,11.369919,12.867048,12.560433,4.016561,3.282867,3.541994,3.54868,3.460083,3.423348
2,GSM362960.CEL.gz,HCC,7.80337,4.134075,3.433113,5.395057,3.476944,5.825713,3.505036,3.687333,...,10.89246,10.416151,12.356337,11.888482,3.839367,3.598851,3.516791,3.484089,3.282626,3.512024
3,GSM362964.CEL.gz,HCC,6.92084,4.000651,3.7545,5.645297,3.38753,6.470458,3.629249,3.577534,...,10.686871,10.524836,12.006596,11.846195,3.867602,3.180472,3.309547,3.425501,3.166613,3.377499
4,GSM362965.CEL.gz,HCC,6.55648,4.59901,4.066155,6.344537,3.372081,5.43928,3.762213,3.440714,...,11.014454,10.775566,12.657182,12.573076,4.09144,3.306729,3.493704,3.205771,3.378567,3.392938


In [3]:
# Examine the data properties 
amount_na= {}
# check shape
print(f"columns in the dataframe:{df.shape[1]}. Amount of rows {df.shape[0]}.")
# print(df.dtypes) # comment out because of cluttering in console
# check NAN
for col in df.columns:
    if df[col].isna().sum() > 0:
        amount_na[col] = df[col].isna().sum()
sorted_amount_na = sorted(amount_na.items(), key=lambda x:x[1], reverse=True)
print(f"the columns with NAN values:{sorted_amount_na}")
print(df["type"].unique(), "<- The groups in the data")


columns in the dataframe:22279. Amount of rows 357.
the columns with NAN values:[]
['HCC' 'normal'] <- The groups in the data


# K-means and spectral clustering
The CSV file comprises 357 samples (rows) and 22277 gene expression levels (columns). This dataset has two distinct categories (column "type").
Furthermore, there are no NAN values present in any of the columns.
The sample names will be removed because they are no longer required.
Colors will go into the Type column.


The dataframe is quite large. It has no NAN values, which is a plus.


In [4]:
types = list(df["type"])
X = df.drop(columns=["samples","type"])

# Scaling and clustering the data (spectral clustering)
We can now check the perfomance of the spectral clustering with silhouette score.
To compute the Silhouette Coefficient for evaluating clustering performance, utilize the scikit-learn library's silhouette_score function.

In [11]:
from sklearn.metrics import silhouette_score
scaler= StandardScaler()
X = scaler.fit_transform(X)
clustering = SpectralClustering(n_clusters=3, assign_labels='discretize', random_state=42)
labels = clustering.fit_predict(X)

# Calculate the Silhouette Coefficient
silhouette_avg = silhouette_score(X, labels)
print("Silhouette Coefficient:", silhouette_avg)


Silhouette Coefficient: -0.00403883763589774




# Conclusion about the spectral clustering.
The fit_predict method is used to cluster the data and obtain the predicted labels. Finally, the silhouette_score function is applied to calculate the average Silhouette Coefficient for the entire dataset. The Silhouette Coefficient ranges from -1 to 1, where values close to 1 indicate well-separated and compact clusters, while values close to -1 suggest overlapping or poorly separated clusters.
in our case the score is realy close to zero but still a negative number meaning that there some overlap in the clustering 

# Scaling and clustering the data (K-means)
Now lets see how good K-means is in clustering


In [12]:
# Perform K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X) 

# Calculate the Silhouette Coefficient
silhouette_avg = silhouette_score(X, labels)
print("Silhouette Coefficient:", silhouette_avg)



Silhouette Coefficient: 0.10076285580932225


# Conclusion about the k-means clustering.
K-means and spectral clustering have distinct properties that can make one method more appropriate than the other depending on the data and underlying structure. K-means is well-known for its simplicity and efficiency, which makes it useful for big datasets with well-separated, spherical clusters. It optimizes cluster centroids using distance metrics with the goal of minimizing the within-cluster sum of squares. Spectral clustering, on the other hand, is useful for complex data with non-linear or unconnected clusters. It captures complicated linkages by using a graph representation of data and spectral qualities. However, spectral clustering is hyperparameter sensitive and needs the construction and decomposition of a similarity graph, making it computationally more expensive. 
Because the data we are utilizing is so large, k-means outperforms the spectral clustering method.