GitHub - hankcs/bayon: a simple and fast clustering tool

Overview

Bayon is a simple and fast hard-clustering tool.

Bayon supports Repeated Bisection clustering and K-means clustering.

Install

% ./configure
% make
% sudo make install

Usage

Clustering input data

% bayon -n num [options] file
% bayon -l limit [options] file
   -n, --number=num      the number of clusters
   -l, --limit=lim       limit value of cluster bisection
   -p, --point           output similarity points
   -c, --clvector=file   save the vectors of cluster centroids
   --clvector-size=num   max size of output vectors of
                         cluster centroids (default: 50)
   --method=method       clustering method(rb, kmeans), default:rb
   --seed=seed           set a seed for random number generator

Get similar clusters for each input documents

% bayon -C file [options] file
   -C, --classify=file   target vectors
   --inv-keys=num        max size of the keys of each vector to be
                         looked up in inverted index (default: 20)
   --inv-size=num        max size of the inverted index of each key
                         (default: 100)
   --classify-size=num   max size of output similar groups
                         (default: 20)

Common options

   --vector-size=num     max size of each input vector
   --idf                 apply idf to input vectors
   -h, --help            show help messages
   -v, --version         show the version and exit

Example

clustering (number_of_output_clusters = 100)

% bayon -n 100 input.tsv > cluster.tsv

clustering (save vectors of cluster centroids)

% bayon -n 100 -c centroid.tsv input.tsv > cluster.tsv

classification (get similar clusters for input documents)

% bayon -C centroid.tsv input.tsv > classify.tsv

Format of Input Data

List of the vectors of input documents for clustering and classification

document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...

document_id : string
key : string
value : double

List of the vectors of cluster centroids

cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...

cluster_id : string
key : string
value : double

Format of Output Data

List of clusters (output of clustering)

cluster_id1 \t document_id1 \t document_id2 \t document_id3 \t ...\n
cluster_id2 \t document_id4 \t document_id5 \t document_id6 \t ...\n
...

cluster_id : integer (>= 1)
document_id : string

List of the clusters with similarity values between documents and clusters (if perform clustering with --point option)

cluster_id1 \t document_id1 \t point1 \t document_id2 \t point2 \t ...\n
cluster_id2 \t document_id3 \t point3 \t document_id4 \t point4 \t ...\n
...

cluster_id : integer (>= 1)
document_id : string
point : double

List of the vectors of cluster centroids (if perform clustering with --clvector option)

cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...

cluster_id : integer (>= 1)
key : string
value : double

List of similar clusters for each input documents

document_id1 \t cluster_id1 \t point1 \t cluster_id2 \t point2 \t ...\n
document_id2 \t cluster_id3 \t point3 \t cluster_id4 \t point4 \t ...\n
...

document_id : string
cluster_id : string
point : double

Requirement

C++ compiler with STL (Standard Template Library)

License

GPL2 (Gnu General Public License Version 2)

Author

Mizuki Fujisawa <fujisawa@bayon.cc>

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
data		data
COPYING		COPYING
Doxyfile		Doxyfile
Makefile.in		Makefile.in
README		README
README.md		README.md
TODO		TODO
VCmakefile		VCmakefile
analyzer.cc		analyzer.cc
analyzer.h		analyzer.h
anatest.cc		anatest.cc
bayon.cc		bayon.cc
bayon.h		bayon.h
byvector.cc		byvector.cc
byvector.h		byvector.h
classifier.cc		classifier.cc
classifier.h		classifier.h
clatest.cc		clatest.cc
cluster.cc		cluster.cc
cluster.h		cluster.h
clutest.cc		clutest.cc
config.h.in		config.h.in
configure		configure
configure.in		configure.in
doctest.cc		doctest.cc
document.cc		document.cc
document.h		document.h
lda.cc		lda.cc
plsi.cc		plsi.cc
util.cc		util.cc
util.h		util.h
utiltest.cc		utiltest.cc
vectest.cc		vectest.cc

License

hankcs/bayon

Folders and files

Latest commit

History

Repository files navigation

Overview

Install

Usage

Clustering input data

Get similar clusters for each input documents

Common options

Example

Format of Input Data

List of the vectors of input documents for clustering and classification

List of the vectors of cluster centroids

Format of Output Data

List of clusters (output of clustering)

List of the clusters with similarity values between documents and clusters (if perform clustering with --point option)

List of the vectors of cluster centroids (if perform clustering with --clvector option)

List of similar clusters for each input documents

Requirement

Recommended

License

Author

About

Resources

License

Stars

Watchers

Forks

Languages