# Cluster Analysis using K-Means Clustering

For the evaluation chapter of our thesis, we collected two different sets, each containing 100 ML repositories. One set contains only randomly selected repositories that are hosted on GitHub. The other consists of randomly selected ML experiments published on "paperswithcode" which linked their corresponding code repository. The only requirements that the repositories had to meet is that the code is hosted on GitHub and the source code is mainly stored in ".py" or ".ipynb" files.

We evaluated each of these sets using the ML Repository Reproducibility Analysis Tool we developed. This calculates a score for each of the detectable reproducibility factors defined in the thesis. We use a list saved in a .csv file that contains the factor scores of both sets (including a label to determine to which of the two sets the repository belongs) as input for the cluster analysis presented here.

Before presenting the process and the results of this analysis, we want to explain why we chose this approach. We are interested in whether, based on the calculated scores, different repository groups can be recognized by a clustering algorithm. In addition, we then want to show what insights we can gain from this using a suitable visualization of the calculated clusters.

The used implementation was adapted from the example shown at https://towardsdatascience.com/clustering-with-more-than-two-features-try-this-to-explain-your-findings-b053007d680a.

## Explaining K-Means Clustering

An unsupervised learning algorithm is ideal for evaluating the data collected by our tool since no target label is required. Clustering is the attempt to define groups within a set of entities. Entities belonging to the same group share some key characteristics. K-means is an iterative algorithm that assigns each entity to the cluster with the closest mean (centroid). In the K-Means algorithm, the k stands for the number of selected clusters and the mean for their centroid. By combining these concepts, groups of data that were not previously known can be identified.

## Advantages and Limitations of this Method (for our case)

In our case, this approach has several advantages. Easier to implement compared to alternatives and only requires the determination of the optimal number of clusters as this is the only significant parameter. In addition, K-Means performs so-called "hard" clustering, which means that each entity is assigned to a cluster (https://link.springer.com/content/pdf/10.1007/s40471-019-00211-7.pdf), which simplifies the interpretation of the results. The relatively high impact of outliers on the result and the fact that certain patterns cannot be mapped well are disadvantages of this approach. In our case, however, these disadvantages are negligible.

## Determining the Number of Clusters using the Elbow Method

First it has to be determined how many clusters the algorithm should have available for classifying the entities. There are different strategies to determine an optimal number of clusters (https://www.researchgate.net/profile/Trupti-Kodinariya/publication/313554124_Review_on_Determining_of_Cluster_in_K-means_Clustering/links/5789fda408ae59aa667931d2/Review-on-Determining-of-Cluster-in-K-means-Clustering.pdf).

We chose the Elbow Method because it can be performed intuitively using a visual graph evaluation. The elbow point is the k where the decrease in inertia when k is compared to k + 1 is significantly less than at other points.
Inertia is the measure of how internally coherent clusters are. The K-means algorithm aims to choose centroids (which are the mean value of the entities in a cluster) that minimize this inertia.

This point can be determined very clearly in some cases. In our case, the graph shows the largest decrease in inertia at k = 2, k = 6 and k = 8 (compared to k - 1 respectively). We choose k = 2 because this value is consistent with the elbow method and also useful in assessing the hypothesis of the thesis. Based on this one can see how many entities of the GitHub or "paperswithcode" set are assigned to which cluster.

## Import Dataset

In [27]:
import pandas as pd
df=pd.read_csv("cluster_analysis_input_set.csv")
print(df)

     SCAD_score  SE_score  DSA_score  RSC_score  MS_score  HPL_score  \
0          0.40      0.20          1        1.0         1          0   
1          0.08      0.20          0        0.0         1          0   
2          0.13      0.20          0        0.0         1          0   
3          0.19      0.20          0        0.0         1          0   
4          0.20      0.20          1        0.0         1          0   
..          ...       ...        ...        ...       ...        ...   
195        0.44      1.00          1        0.0         0          0   
196        0.90      0.56          1        1.0         0          0   
197        0.65      0.13          0        1.0         1          0   
198        0.68      0.20          1        1.0         1          0   
199        0.91      0.91          1        0.0         1          1   

       from_set  
0        github  
1        github  
2        github  
3        github  
4        github  
..          ...  
195  scie

## The Ellbow Method

In [28]:
from sklearn.cluster import KMeans
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
X=df.drop("from_set",axis=1);
inertia = [];
for i in range(1,11):
    kmeans = KMeans(
        n_clusters=i, init="k-means++",
        n_init=10,
        tol=1e-04, random_state=42
    );
    kmeans.fit(X);
    inertia.append(kmeans.inertia_);
fig_inertia = go.Figure(data=go.Scatter(x=np.arange(1,11),y=inertia));
fig_inertia.update_layout(xaxis=dict(range=[0,11],title="Cluster Number"),
                  yaxis={'title':'Inertia'},
                 annotations=[
        dict(
            x=2,
            y=inertia[1],
            xref="x",
            yref="y",
            text="Elbow",
            showarrow=True,
            arrowhead=7,
            ax=20,
            ay=-40
        )
    ])

fig_inertia.show(renderer="png")

## Computing & Visualizing the Clusters

In this polar line plot, each cluster is represented as vertex of a polyline mark in polar coordinates.

In [29]:
kmeans = KMeans(
        n_clusters=2, init="k-means++",
        n_init=10,
        tol=1e-04, random_state=42
    )
kmeans.fit(X)
clusters=pd.DataFrame(X,columns=df.drop("from_set",axis=1).columns)
clusters['label']=kmeans.labels_
polar=clusters.groupby("label").mean().reset_index()
polar=pd.melt(polar,id_vars=["label"])
fig_polar = px.line_polar(polar, r="value", theta="variable", color="label", line_close=True,height=800,width=1400)
fig_polar.show(renderer="png")

## Cluster Characteristics

### Table with Measured Mean Factor Scores for Each Cluster 

|   | SCAD  | SE | DSA | RSC | MS | HPL |
|---|---|---|---|---|---|---|
| Cluster 0 | 0.63 | 0.41 | 0.82 | 0.99 | 0.66 | 0.09 |
| Cluster 1 | 0.44 | 0.34 | 0.62 | 0 | 0.39 | 0.06 |

Based on this, the key difference seems to be the scoring of the random seed control. Also significant differences of the source code analysis & documentation, data set availability and model-serialisatzion factors can be observed. The scoring of the software environment is almost equal, with a slight tendency towards Cluster 0. Hyperparameter-logging seems to be not relevant. But this score was also for 84% of all entities 0.

Overall one can observe, that the mean of each factor of an entitiy is higher in Cluster 0. This indicates that the entities in Cluster 0 comply better with the reproducibility guidelines in comparison with the one from Cluster 1.

## How many Entities are in the Clusters

We found that 58% of the entites of the total dataset are in Cluster 0 (116 of 200). The rest of them is in Cluster 1.

Additionally, we are interested in how many of the entities in each cluster belong to the set collected from GitHub or paperswithcode. We cover this in the last section of this notebook.

In [30]:
pie=clusters.groupby('label').size().reset_index()
pie.columns=['label','value']
fig_pie = px.pie(pie,values='value',names='label',color=['blue','red'])
fig_pie.show(renderer="png")

## Saving the result to .csv

In [31]:
clusters.to_csv('clusters.csv', encoding='utf-8')

df2=pd.read_csv("clusters.csv")
print(df2)

     Unnamed: 0  SCAD_score  SE_score  DSA_score  RSC_score  MS_score  \
0             0        0.40      0.20          1        1.0         1   
1             1        0.08      0.20          0        0.0         1   
2             2        0.13      0.20          0        0.0         1   
3             3        0.19      0.20          0        0.0         1   
4             4        0.20      0.20          1        0.0         1   
..          ...         ...       ...        ...        ...       ...   
195         195        0.44      1.00          1        0.0         0   
196         196        0.90      0.56          1        1.0         0   
197         197        0.65      0.13          0        1.0         1   
198         198        0.68      0.20          1        1.0         1   
199         199        0.91      0.91          1        0.0         1   

     HPL_score  label  
0            0      0  
1            0      1  
2            0      1  
3            0      1  
4  

## Saving the figures to /images

In [32]:
import os

if not os.path.exists("images"):
    os.mkdir("images")

# elbow method figure
fig_inertia.write_image("images/fig_inertia.pdf")
# polar line figure
fig_polar.write_image("images/fig_polar.pdf")
# entity-cluster-distribution
fig_pie.write_image("images/fig_pie.pdf")



## Findings

### Random GitHub set

44/100 in Cluster 0. 

56/100 in Cluster 1.

There seems to be an almost even distribution within this sub set (slight tendency towards Cluster 1).

### Random scientific set

72/100 in Cluster 0.

28/100 in Cluster 1.

There seems to be a tendency for Cluster 0 within this sub set.