# Hierarchical clustering on the Zoo dataset with Passively Obtained Quadruplets

The Zoo dataset is a dataset composed of animals with various characteristics. Here, we show how we can use ComparisonHC to learn a dendrogram of the animals.

## Imports
We start by importing the ComparisonHC class, the core of the method.

In [1]:
from comparisonhc import ComparisonHC

Then we choose an oracle from the module oracle. Here, we settle for a passive oracle, that is an oracle which emulates the fact that quadruplets are passively obtained.

In [2]:
from comparisonhc.oracle import OraclePassive

The oracle uses a similarity function to generate the quadruplets. We use the cosine similarity provided by scikit-learn.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

We also need to choose the linkage used by ComparisonHC. Here, we choose the average linkage that directly uses comparisons (4-AL in the reference paper). 

In [4]:
from comparisonhc.linkage import OrdinalLinkageAverage

Finally we import numpy for array manipulations.

In [5]:
import numpy as np

## The Zoo Dataset

The Zoo dataset contains 100 animals with 16 features each separated in 7 groups. First, we extract the name, the features, and the group of each animal from the file.

In [6]:
animals = []
x = []
y = []
with open("../resources/zoo.csv",'r') as f:
    first_line = True
    for line in f:
        if first_line:
            first_line = False
        else:
            split_line = line.split(",")
            animals.append(split_line[0])
            x.append(split_line[1:-1])
            y.append(split_line[-1])
x = np.array(x,dtype=float)
y = np.array(y,dtype=int)

n = x.shape[0]

## Initializing ComparisonHC

### Oracle

To initialize ComparisonHC we start by creating an oracle exhibiting three methods to access the quadruplets, comparisons, comparisons_to_ref, and comparisons_single. Here we chose a passive oracle, that is an oracle that emulates the fact that the quadruplets are passively obtained. In other words, when we query a quadruplet, the oracle can freely choose to answer or to abstain, and we have no way to control this behaviour. We assume that we have access to $10\%$ of the quadruplets.

In [7]:
oracle = OraclePassive(x,metric=cosine_similarity,proportion_quadruplets=0.1)

### Linkage

We also need to create the linkage object that will be used to merge the clusters. This object exhibits a single method called closest_clusters that can be used to choose which clusters to merge next. We use an average linkage using only comparisons.

In [8]:
linkage = OrdinalLinkageAverage(oracle)

### ComparisonHC

We can now create the main ComparisonHC object using the linkage defined above.

In [9]:
chc = ComparisonHC(linkage)

## Learning the Dendrogram

To learn a dendrogram we need to use the fit method from ComparisonHC with initial clusters. Here, we start with one example per cluster.

In [10]:
chc.fit([[i] for i in range(n)])

print("ComparisonHC ran for {:.2f} seconds.".format(chc.time_elapsed))

ComparisonHC ran for 12.87 seconds.


## Evaluating the Dendrogram

To evaluate the performance or the learned dendrogram we can use Dasgupta's cost since, in this particular case, we have acces to the similarity matrix of the examples.

In [11]:
cost_chc = chc.cost_dasgupta(cosine_similarity(x,x))

print("ComparisonHC learned a dendrogram with a Dasgupta's cost of {:.2f}.".format(cost_chc))

ComparisonHC learned a dendrogram with a Dasgupta's cost of 171748.48.
