In this assignment, you will be implementing two clustering validation measures: Normalized Mutual Information (NMI) and Jaccard similarity.

You will be given one set of ground-truth clustering (partition) results and five clustering test cases. You need to evaluate the clustering test cases with regard to the ground-truth by NMI and Jaccard measures and submit your measures. You will be graded based on whether your measures are correct.

**Use base 2 for all the logarithm in NMI calculation.**

The ground-truth clustering (partition) results are stored in file "partitions.txt"; the five clustering result test cases are stored in file "clustering_1.txt", ..., "clustering_5.txt".

All files including partitions.txt, clustering_1.txt, ..., can be downloaded from the `data.zip` file attached below.

Each clustering result (both ground-truth and test cases) is represented by a file. Each line in a file consists of two integers, separated by a space. The first integer represents the id of a data item, and the second integer represents the id of the cluster that this item belongs to.

You need to submit a file titled "scores.txt" consisting of 5 lines. Each line contains two float numbers separated by a space. The first number of the i-th line represents the NMI measure you calculated for the i-th test case i (i.e., "clustering_i.txt") with regard to the ground-truth given in "partitions.txt", and the second number of the i-th line represents the Jaccard measure you calculated for the i-th test case.

As an example, a valid submission may look like:
```
0.1000000 0.2000000
0.3000000 0.4000000
0.5000000 0.6000000
0.7000000 0.8000000
0.9000000 1.0000000
```
You will be graded based on whether your file format is correct and on how many of the measures you submitted are correct.

In [8]:
from __future__ import print_function
from __future__ import division
import pandas as pd
import numpy as np
from collections import Counter
from scipy.special import comb

In [9]:
# read in data
truth = pd.read_csv('data/partitions.txt',sep=' ',names=['id','label'],index_col=['id'])
clusters = []
for i in range(1,6):
    filename = str(i).join(['data/clustering_','.txt'])
    clusters.append(pd.read_csv(filename ,sep=' ',names=['id','label'],index_col=['id']))

In [10]:
clusters

[     label
 id        
 1        2
 2        0
 3        2
 4        1
 5        2
 6        1
 7        1
 8        1
 9        2
 10       2
 11       0
 12       0
 13       1
 14       1
 15       1
 16       2
 17       1
 18       0
 19       1
 20       2
 21       1
 22       0
 23       0
 24       2
 25       1
 26       2
 27       1
 28       0
 29       2
 30       1
 ..     ...
 271      1
 272      0
 273      2
 274      1
 275      1
 276      2
 277      1
 278      2
 279      1
 280      0
 281      1
 282      0
 283      1
 284      1
 285      1
 286      0
 287      0
 288      2
 289      1
 290      0
 291      2
 292      0
 293      2
 294      2
 295      2
 296      0
 297      0
 298      2
 299      2
 300      2
 
 [300 rows x 1 columns],      label
 id        
 1        2
 2        0
 3        1
 4        1
 5        2
 6        1
 7        1
 8        2
 9        1
 10       2
 11       0
 12       2
 13       1
 14       1
 15       1
 16       2
 1

In [11]:
# definitions from Lesson 6.5 and 6.6
def Entropy(label):
    count_dict = Counter(label)
    total = label.count()
    H = 0
    for value in count_dict.values():
        p = value/total
        H-= p *np.log(p)
    return H


def Mutual_information(cluster_label,truth_label):
    cluster_dict = Counter(cluster_label)
    truth_dict = Counter(truth_label)
    total = cluster_label.count()
    in_label = [(c,t) for c,t in zip(cluster_label,truth_label)]
    in_dict = Counter(in_label)
    I = 0
    for k,v in in_dict.items():
        c,t = k
        pij = v/total
        pc = cluster_dict[c]/total
        pt = truth_dict[t]/total
        I += pij*np.log(pij/(pc*pt))
    return I


def NMI(cluster_label,truth_label):
    I = Mutual_information(cluster_label,truth_label)
    H_c = Entropy(cluster_label)
    H_t = Entropy(truth_label)
    return I/np.sqrt(H_c*H_t)


def Jaccard_coef(cluster_label,truth_label):
    cluster_dict = Counter(cluster_label)
    truth_dict = Counter(truth_label)
    total = cluster_label.count()
    in_label = [(c,t) for c,t in zip(cluster_label,truth_label)]
    in_dict = Counter(in_label)
    TP = 0
    for k,v in in_dict.items():
        TP += v**2
    TP  = 0.5*(TP - total)
    FN = 0
    for k,v in cluster_dict.items():
        FN += comb(v,2)
    FN -= TP
    FP = 0
    for k,v in truth_dict.items():
        FP += comb(v,2)
    FP -= TP
    Jaccard = TP/(TP+FN+FP)
    return Jaccard

In [44]:
# evaluate NMI and Jaccard
NMI_score = [NMI(cluster['label'],truth['label']) for cluster in clusters]
Jaccard_score = [Jaccard_coef(cluster['label'],truth['label']) for cluster in clusters]

# write result file
scores = pd.DataFrame({'NMI':NMI_score,'Jaccard':Jaccard_score},columns=['NMI','Jaccard'])
scores.to_csv('scores.txt',sep=' ',header=False,index=False)

NameError: name 'NMI' is not defined

In [16]:
# check - Jaccard is wrong!!!
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics.cluster import normalized_mutual_info_score


NMI_score_check = [normalized_mutual_info_score(cluster['label'],truth['label']) for cluster in clusters]
Jaccard_score_check = [jaccard_similarity_score(cluster['label'],truth['label']) for cluster in clusters]

# write result file
scores = pd.DataFrame({'NMI':NMI_score_check,'Jaccard':Jaccard_score_check},columns=['NMI','Jaccard'])
scores.to_csv('scores2.txt',sep=' ',header=False,index=False)

In [54]:
# attempt 2
# https://stats.stackexchange.com/questions/89030/rand-index-calculation

import numpy as np
from scipy.special import comb

def Jac_Coef(clusters, classes):

    tp_plus_fp = comb(np.bincount(clusters), 2).sum()
    tp_plus_fn = comb(np.bincount(classes), 2).sum()
    A = np.c_[(clusters, classes)]
    tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
             for i in set(clusters))
    fp = tp_plus_fp - tp
    fn = tp_plus_fn - tp
    tn = comb(len(A), 2) - tp - fp - fn
    
    Jaccard = tp/(tp+fn+fp)
    return Jaccard


In [55]:
Jaccard_score_check2 = [Jac_Coef(cluster['label'],truth['label']) for cluster in clusters]

#write result file (only difference is Jaccard score)
scores = pd.DataFrame({'NMI':NMI_score_check,'Jaccard':Jaccard_score_check2},columns=['NMI','Jaccard'])
scores.to_csv('scores3.txt',sep=' ',header=False,index=False)