Supervised Hierarchical Clustering with Exponential Linkage

This repository contains code used in experiments for our ICML 2019 paper titled "Supervised Hierarchical Clustering with Exponential Linkage".

Setup

Clone^* and setup xcluster repository from https://github.com/iesl/xcluster. Make sure xcluster repo is cloned in the same folder as this repo i.e. you should have xcluster and expLinkage folder in the same parent folder.

Set environment variables:

cd expLinkage
source bin/setup.sh

Data Setup

Data in n-dim vector space

clusterFile parameter in config files should point to the tsv file which contains data with each line in following format:

<point_id> <cluster_id> <dim-1> <dim-2> .... <dim-n>

Data with features defined on every pair of points

dataDir parameter in config files should point to data folder which should be present in the following format:

├── NP_Coref
|   ├── doc1
|       ├── gtClusters.tsv
|       ├── pairFearues.tsv
|   ├── doc2
|   ├── ...
|   ├── docn

All data should be in a single folder with a separate sub-folder for each canopy or set of points. Each sub-folder contains files: gtClusters.tsv and pairFeatures.tsv.

gtClusters.tsv contains information about ground-truth clusters for each point in following format: <pointId> <clusterId>

pairFeatures.tsv contains feature vector for each pair of points in following format:
<pointId_1> <pointId_2> <feature_1> <feature_2> ... <feature_n>

Set of points in each subfolder will be clustered separately.

Run Code

For data in n-dim vector space

cd expLinkage
python src/trainer/train_vect_data.py --config=<rel_path_to_config_file> --seed=<random_seed>

For data with features on every pair of points

cd expLinkage
python src/trainer/train_pair_feat.py --config=<rel_path_to_config_file> --seed=<random_seed>

Config files for all experiments in the paper are present in config folder.

Notes

*Code from xcluster repository is only used for evaluating dendrogram purity and is not crucial for training as such (if evaluation does not involve computing dendrogram purity or no evaluation on dev set is peformed during training).
Compatible cuda and pytorch versions:
- cudnn : version 7.6.0, (build: cuda10.0_0)
- pytorch : version 1.2.0 (build cuda100py36h938c94c_0)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bin		bin
config		config
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supervised Hierarchical Clustering with Exponential Linkage

Setup

Data Setup

Data in n-dim vector space

Data with features defined on every pair of points

Run Code

For data in n-dim vector space

For data with features on every pair of points

Notes

About

Releases

Packages

Languages

License

iesl/expLinkage

Folders and files

Latest commit

History

Repository files navigation

Supervised Hierarchical Clustering with Exponential Linkage

Setup

Data Setup

Data in n-dim vector space

Data with features defined on every pair of points

Run Code

For data in n-dim vector space

For data with features on every pair of points

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages