This version treats the soft-clustering info as a bipartite weighted graph and partitions it by following a spectral graph partitioning approach.

**Cons** Slightly lower NMI values than the embedding approach.

**Pros** Works for larger data (e.g. Reuters).

```
Dhillon, I. S. (2001, August). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 269-274).
```

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.3011

In [1]:
from scipy.sparse import lil_matrix, csr_matrix
import numpy as np
import glob
from tabulate import tabulate
from tqdm import tqdm
import os
from scipy.sparse import csr_matrix
from sklearn.cluster import SpectralCoclustering
from clustering_evaluation import computeEvaluationMeasures, MEASURES


def getLabelsFromFile(fullPath):
    v = np.loadtxt(fullPath, delimiter=',', dtype=np.int)
    return v


def detectPostfixesInDir(fullPath, dirName):
    fnLst = glob.glob('{0}/{1}/*labels_*.out'.format(fullPath, dirName))
    # labels_NCF_DS_
    postFixes = [f.split(os.sep)[-1].replace('labels_NCF_DS_','').replace('.out','') for f in fnLst]
    return postFixes

In [13]:
def cocl_consensus(dsDirPath, nruns=10):
    """
    precomputed_sim: Signals if the file with the soft-clustering based similarity matrix is already computed and stored.
    """
    trueLblsFname = "{0}/{1}/{2}".format(path, dsDirPath, 'labels.true')
    trueL = getLabelsFromFile(trueLblsFname)
    postFixesDS = detectPostfixesInDir(path, dsDirPath)

    dsresults = []
    header = ['file']
    header.extend(MEASURES)
    np.random.seed(87009)
    n_clusters = np.unique(trueL).shape[0]
    runSeed = np.random.randint(0, high=2e8)
    
    t = tqdm(postFixesDS, leave=True)
    
    for runPostfix in t:
        softClusFN = "{0}/{1}/{2}{3}.out".format(path, dsDirPath, 'simmatrix_DS_', runPostfix)
        print("Opening soft-clustering matrix from {0}".format(softClusFN))
        #scM = csr_matrix(np.loadtxt(softClusFN, delimiter=',') )# soft-clustering matrix        
        scM = np.loadtxt(softClusFN, delimiter=',') # doesn't affect because dot-sim doesn't change!
        scM = csr_matrix(scM[:,~np.all(scM == 0, axis = 0)])
        print(scM.shape)

        
        model = SpectralCoclustering(n_clusters=scM.shape[1], random_state=np.random.randint(0, high=2e8))
        model.fit(scM)
        pred_labels = list(model.row_labels_)

        perf = computeEvaluationMeasures(trueL, pred_labels)
        runRow = [runPostfix]
        runRow.extend([perf[m] for m in MEASURES])
        dsresults.append(runRow)
        
    #dsresults.append( ['Average'] + list(np.mean(dsresults, axis=0)[1:]) )
    print(tabulate(dsresults, headers=header, tablefmt='github', floatfmt='.6f', showindex=True))

In [3]:
path = '../stage1_results'

In [14]:
%%time
cocl_consensus('BBC-seg2')

 20%|██        | 2/10 [00:00<00:00, 12.75it/s]

Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170037.out
(2012, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170041.out
(2012, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170045.out
(2012, 5)

 40%|████      | 4/10 [00:00<00:00, 11.94it/s]


Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170049.out
(2012, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170053.out
(2012, 5)


 60%|██████    | 6/10 [00:00<00:00, 11.67it/s]

Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170058.out
(2012, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170102.out
(2012, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170106.out
(2012, 5)

 80%|████████  | 8/10 [00:00<00:00, 11.57it/s]


Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170110.out
(2012, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg2/simmatrix_DS_Jan022021.170115.out
(2012, 5)


100%|██████████| 10/10 [00:00<00:00, 11.60it/s]

|    | file             |        E |        P |       F1 |      ACC |      NMI |     PREC |      REC |      ARI |
|----|------------------|----------|----------|----------|----------|----------|----------|----------|----------|
|  0 | Jan022021.170037 | 0.169009 | 0.940358 | 0.188427 | 0.194831 | 0.830196 | 0.182485 | 0.194831 | 0.855050 |
|  1 | Jan022021.170041 | 0.169009 | 0.940358 | 0.013685 | 0.014414 | 0.830196 | 0.013131 | 0.014414 | 0.855050 |
|  2 | Jan022021.170045 | 0.169009 | 0.940358 | 0.017935 | 0.018390 | 0.830196 | 0.017756 | 0.018390 | 0.855050 |
|  3 | Jan022021.170049 | 0.169009 | 0.940358 | 0.006426 | 0.006461 | 0.830196 | 0.006482 | 0.006461 | 0.855050 |
|  4 | Jan022021.170053 | 0.169009 | 0.940358 | 0.006228 | 0.006461 | 0.830196 | 0.006044 | 0.006461 | 0.855050 |
|  5 | Jan022021.170058 | 0.169009 | 0.940358 | 0.028249 | 0.027833 | 0.830196 | 0.028840 | 0.027833 | 0.855050 |
|  6 | Jan022021.170102 | 0.169009 | 0.940358 | 0.398286 | 0.394632 | 0.830196 | 0.40213




In [15]:
%%time
cocl_consensus('BBC-seg3')

 20%|██        | 2/10 [00:00<00:00, 17.09it/s]

Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171003.out
(1268, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171008.out
(1268, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171012.out
(1268, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171016.out
(1268, 5)

 60%|██████    | 6/10 [00:00<00:00, 16.51it/s]


Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171020.out
(1268, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171025.out
(1268, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171029.out
(1268, 5)

 80%|████████  | 8/10 [00:00<00:00, 16.91it/s]


Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171033.out
(1268, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171037.out
(1268, 5)
Opening soft-clustering matrix from ../stage1_results/BBC-seg3/simmatrix_DS_Jan022021.171041.out
(1268, 5)

100%|██████████| 10/10 [00:00<00:00, 16.42it/s]


|    | file             |        E |        P |       F1 |      ACC |      NMI |     PREC |      REC |      ARI |
|----|------------------|----------|----------|----------|----------|----------|----------|----------|----------|
|  0 | Jan022021.171003 | 0.297914 | 0.758675 | 0.151235 | 0.216877 | 0.753210 | 0.186323 | 0.216877 | 0.651198 |
|  1 | Jan022021.171008 | 0.297914 | 0.758675 | 0.415817 | 0.481073 | 0.753210 | 0.451374 | 0.481073 | 0.651198 |
|  2 | Jan022021.171012 | 0.298319 | 0.758675 | 0.013851 | 0.008675 | 0.749798 | 0.056147 | 0.008675 | 0.646488 |
|  3 | Jan022021.171016 | 0.297914 | 0.758675 | 0.030039 | 0.022871 | 0.753210 | 0.098958 | 0.022871 | 0.651198 |
|  4 | Jan022021.171020 | 0.297914 | 0.758675 | 0.030039 | 0.022871 | 0.753210 | 0.098958 | 0.022871 | 0.651198 |
|  5 | Jan022021.171025 | 0.297914 | 0.758675 | 0.030039 | 0.022871 | 0.753210 | 0.098958 | 0.022871 | 0.651198 |
|  6 | Jan022021.171029 | 0.297914 | 0.758675 | 0.146725 | 0.213722 | 0.753210 | 0.1793




In [16]:
%%time
cocl_consensus('Reuters')

  0%|          | 0/10 [00:00<?, ?it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.224759.out
(18758, 6)


 10%|█         | 1/10 [00:00<00:03,  2.70it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.225448.out
(18758, 6)


 20%|██        | 2/10 [00:00<00:02,  2.76it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.230154.out
(18758, 6)


 30%|███       | 3/10 [00:01<00:02,  2.77it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.230903.out
(18758, 6)


 40%|████      | 4/10 [00:01<00:02,  2.75it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.231605.out
(18758, 6)


 50%|█████     | 5/10 [00:01<00:01,  2.79it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.232310.out
(18758, 6)


 60%|██████    | 6/10 [00:02<00:01,  2.79it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.233013.out
(18758, 6)


 70%|███████   | 7/10 [00:02<00:01,  2.78it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.233717.out
(18758, 6)


 80%|████████  | 8/10 [00:02<00:00,  2.77it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.234421.out
(18758, 6)


 90%|█████████ | 9/10 [00:03<00:00,  2.79it/s]

Opening soft-clustering matrix from ../stage1_results/Reuters/simmatrix_DS_Jan022021.235122.out
(18758, 6)


100%|██████████| 10/10 [00:03<00:00,  2.79it/s]

|    | file             |        E |        P |       F1 |      ACC |      NMI |     PREC |      REC |      ARI |
|----|------------------|----------|----------|----------|----------|----------|----------|----------|----------|
|  0 | Jan022021.224759 | 0.544396 | 0.650602 | 0.126893 | 0.157959 | 0.420185 | 0.142954 | 0.157959 | 0.349399 |
|  1 | Jan022021.225448 | 0.544396 | 0.650602 | 0.113274 | 0.143512 | 0.420185 | 0.094659 | 0.143512 | 0.349399 |
|  2 | Jan022021.230154 | 0.564636 | 0.635462 | 0.116046 | 0.112699 | 0.396329 | 0.150076 | 0.112699 | 0.341909 |
|  3 | Jan022021.230903 | 0.544396 | 0.650602 | 0.030761 | 0.033639 | 0.420185 | 0.031427 | 0.033639 | 0.349399 |
|  4 | Jan022021.231605 | 0.544396 | 0.650602 | 0.113274 | 0.143512 | 0.420185 | 0.094659 | 0.143512 | 0.349399 |
|  5 | Jan022021.232310 | 0.544396 | 0.650602 | 0.097154 | 0.085670 | 0.420185 | 0.134117 | 0.085670 | 0.349399 |
|  6 | Jan022021.233013 | 0.544396 | 0.650602 | 0.274615 | 0.290596 | 0.420185 | 0.36611


