Principal Direction Divisive Partitioning (PDDP)

Description

PDDP is clustering algorithm, which takes as input a set of vectors in n x m matrix format, usually m >> n, where each row of it, is a vector. Using the input matrix, it calculates the 'principal component', an eigenvector, as output.

The first step of the pddp algorithm is to calculate the average vector of all elements of matrix M as follows:

w = (1/m) * ( d₁ + d₂ + ... + d_m )

The next step is to calculate the matrix C:

C = (M − w * e^T)^T * (M − w * e^T) = A^T * A,
where A = (M − w * e^T) and e = [111 ... 1]^T

In a nutshell, from every column of the input matrix M, the average of the input (matrix A) should be subtracted and the result will be multiplied with the transpose of A.

After that, the principal component (the eigenvector) approximation is calculated iteratively. Starting with a vector x₀ = [111...1], the final result is approached as follows:

x_k+1 = (C * x_k) / || C * x_k ||

The algorithm converges, when a new x_k+1 vector has a small difference from the previous x_k. This eigenvector is used to separate the input data in two subsets A and B as follows. The eigenvector is has positive and negative values. Each value of it corresponds to a vector in the input matrix. The first value corresponds to the first vector the second to the second etc. Subsets A and B are comprised of the vectors matched with a positive and negative values respectively.

Further clustering can be achieved by giving one of these subsets as input to the algorithm.

More information can be found here

Implementations

Three implementations of the pddp algorithm are provided in total.

Version	Directory	Description
Serial	serial/	A serial implementation in C.
Naive implementation in CUDA	naive/	Naive implementation in CUDA using exclusively global memory.
Optimized implementation in CUDA	optimized/	Optimal implementation in CUDA using shared memory.

Compile Instructions

Change directory to the desired implementation, and run:

make

The executable will be placed in the 'bin/' directory and the object files in the 'build/'.

Run

All the implementations produce an executable called 'pddp' in the 'bin/' directory. It is executed as follows:

./pddp input_file [output_file]

'input_file' a file in .csv format, each column represents a vector
'output_file' it's the final eigenvector in .csv format, default filename is 'output.csv'

Tests

Test datasets are not uploaded at the moment due to their large size. The test.py script can be used by providing your own datasets and results.

Examples

Execute and test the 'serial' implementation with the 1st dataset

./test.py -t1 serial

Execute and test the 'shared' implementation with all the datasets and clean afterwards (object files and results).

./test.py --all --clean shared

Execute and test the 'shared' implementation with the 3rd dataset up to the 6th decimal place.

./test.py -d6 -t3 shared

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
naive		naive
optimized		optimized
serial		serial
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Principal Direction Divisive Partitioning (PDDP)

Description

Implementations

Compile Instructions

Run

Tests

Examples

About

Releases

Packages

Contributors 2

Languages

johnr1/pddp-cuda

Folders and files

Latest commit

History

Repository files navigation

Principal Direction Divisive Partitioning (PDDP)

Description

Implementations

Compile Instructions

Run

Tests

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages