In this assignment, you will be implementing two clustering validation measures: Normalized Mutual Information (NMI) and Jaccard similarity.

You will be given one ground-truth clustering (partition) results and five clustering test cases. You need to evaluate the clustering test cases with regard to the ground-truth by NMI and Jaccard measures and submit your measures. You will be graded based on whether your measures are correct. 

Each clustering result (both ground-truth and test cases) is represented by a file. Each line in a file consists of two integers, separated by a space. The first integer represents the id of a data item, and the second integer represents the id of the cluster which this item belongs to.

You need to submit a file titled "scores.txt" consisting of 5 lines. Each line contains two float numbers separated by a space. The first number of the i-th line represents the NMI measure you calculated for the i-th test case i (i.e. "clustering_i.txt") with regard to the ground-truth given in "partitions.txt", and the second number of the i-th line represents the Jaccard measure you calculated for the i-th test case. 

In [1]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import normalized_mutual_info_score

In [2]:
#https://gist.github.com/ramhiser/c990481c387058f3cce7

import itertools 

def jaccard(labels1, labels2):
    """
    Computes the Jaccard similarity between two sets of clustering labels.
    The value returned is between 0 and 1, inclusively. A value of 1 indicates
    perfect agreement between two clustering algorithms, whereas a value of 0
    indicates no agreement. For details on the Jaccard index, see:
    http://en.wikipedia.org/wiki/Jaccard_index
    Example:
    labels1 = [1, 2, 2, 3]
    labels2 = [3, 4, 4, 4]
    print jaccard(labels1, labels2)
    @param labels1 iterable of cluster labels
    @param labels2 iterable of cluster labels
    @return the Jaccard similarity value
    """
    n11 = n10 = n01 = 0
    n = len(labels1)
    # TODO: Throw exception if len(labels1) != len(labels2)
    for i, j in itertools.combinations(range(n), 2):
        comembership1 = labels1[i] == labels1[j]
        comembership2 = labels2[i] == labels2[j]
        if comembership1 and comembership2:
            n11 += 1
        elif comembership1 and not comembership2:
            n10 += 1
        elif not comembership1 and comembership2:
            n01 += 1
    return float(n11) / (n11 + n10 + n01)

In [3]:
partitions = []

with open('partitions.txt') as f:
    content = f.readlines()

content = [x.strip() for x in content]     #to remove newline characters

for x in content:                          #to convert each line to a list
    temp = [int(y) for y in x.split(' ')]
    partitions.append(temp[1])
    
partitions[0:5]

[2, 0, 2, 1, 2]

In [4]:
scores = []

for x in range (1,6): 
    with open('clustering_'+str(x)+'.txt') as f:
        content = f.readlines()

    content = [x.strip() for x in content]     #to remove newline characters
    
    cluster = []
    for x in content:                          #to convert each line to a list
        temp = [int(y) for y in x.split(' ')]
        cluster.append(temp[1])
    
    NMI = normalized_mutual_info_score(partitions, cluster, average_method='arithmetic')
    jac = jaccard(partitions, cluster)
    
    scores.append([round(NMI,7), round(jac,7)])

scores

[[0.8896248, 0.911689],
 [0.6456368, 0.6794843],
 [0.3915437, 0.4649305],
 [0.7642771, 0.8005979],
 [0.7336804, 0.5975855]]

In [5]:
with open('scores.txt', 'w') as f:
    for x in scores:
        f.write('{} {}\n'.format(x[0], x[1]))

In [6]:
#sklearn_jac = jaccard_similarity_score(partitions, cluster)