# Number of Cluster Optimization

Selecting the optimal number of clusters is a crucial step in clustering analysis, as it influences the interpretability and quality of the results. Several techniques can be employed to estimate the number of clusters in a graph. Here, we focus on three popular methods used in spectral clustering:

1. **Eigen-Gap Heuristic**
2. **Elbow Method**
3. **Silhouette Score**

In [1]:
from infre.preprocess.collection import Collection

path = "collections/CF/docs"
# load queries, relevant documents, collection
queries, rels = Collection.load_qd("collections/CF")
col = Collection(path).create(first=-1)


Collection Done! 1239 documents were parsed.


In [None]:
from infre.models.cgsb import ConGSB
from numpy import mean
from time import time

for sim in range (1, 10):
    print(f"Iteration {sim} with Similarity: {sim/10}")
    start_time = time()
    cgsb_model = ConGSB(col, clusters=50, cond={'sim': sim/10},cluster_optimization="eigen_gap")
    pre, rec = cgsb_model.fit_evaluate(queries, rels)
    print(f'CGSB: {mean(pre):.3f}, {mean(rec):.3f}')
    print(cgsb_model.graph.number_of_nodes(), cgsb_model.graph.number_of_edges())
    cgsb_model.save_results(pre, rec)
    print(f"Time: {time()-start_time:.2f}s")


Iteration 1 with Similarity: 0.1
Cluster optimization is set to eigen_gap
Cluster optimization is enabled.
9620
0.009145258608772217 % pruning. 172 edges were pruned out of 1880756.
Query 1/100: len = 7, frequent = 85
=> Query 1/100, precision = 0.387, recall = 0.515
Query 2/100: len = 11, frequent = 250
=> Query 2/100, precision = 0.075, recall = 0.571
Query 3/100: len = 7, frequent = 100
=> Query 3/100, precision = 0.169, recall = 0.512
Query 4/100: len = 5, frequent = 24
=> Query 4/100, precision = 0.093, recall = 0.556
Query 5/100: len = 3, frequent = 7
=> Query 5/100, precision = 0.098, recall = 0.504
Query 6/100: len = 13, frequent = 287
=> Query 6/100, precision = 0.145, recall = 0.521
Query 7/100: len = 9, frequent = 100
=> Query 7/100, precision = 0.130, recall = 0.518
Query 8/100: len = 7, frequent = 37
=> Query 8/100, precision = 0.030, recall = 0.523
Query 9/100: len = 7, frequent = 51
=> Query 9/100, precision = 0.138, recall = 0.550
Query 10/100: len = 6, frequent = 63
=>