Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Java PCA transformation of a data matrix
Java R Makefile HTML
tag: 0.7.2

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
docs
src/pca_transform
test/pca_transform
.gitignore
MIT-LICENSE.txt
README.mkd
build.xml
ivy.xml
ivysettings.xml

README.mkd

Introduction

This is a Java library that implements Principal Component Analysis (PCA) data transformation. It operates on data matrices where each row corresponds to a single real vector and each column corresponds to a single dimension.

Downloads

The library can be downloaded as a binary or source package from https://github.com/mkobos/pca_transform/downloads.

The source can be also retrieved straight from the Git repository: git clone git@github.com:mkobos/pca_transform.git.

Features

The main use case of the library is to create a PCA transformation based on training data and then to apply it on test data. The library can perform two transformations on the test data:

  • Rotation -- the data matrix is transformed to to get a diagonal covariance matrix. This transformation is sometimes simply called PCA.
  • Whitening -- the data matrix is transformed to to get the unit covariance matrix.

The method deals with situations where some of the columns in the data matrix are linearly dependent or when there are more columns than rows in the data matrix i.e. there are more dimensions than samples in the data set. In such cases, the method discards some of the output space dimensions, or in other words a transformation is generated where the output vectors have less dimensions than the input ones. To handle situations where the columns are almost linearly dependent, we use a threshold parameter to discard dimensions with almost zero standard deviation; it set to a default value (tol = sqrt(.Machine$double.eps)) introduced in the help page of the prcomp function from the R statistical environment (see the part of the help page where the tol parameter is described). The .Machine$double.eps is the floating-point number machine precision from the R statistical environment (normally equal to 2.220446e-16)

The library can be also used to check if given test point lies in the PCA-generated subspace. If it does not, it means that it does not belong to the transformation domain and can be treated as an outlier (see Fig.1 for an illustration). Here, we use the following practical rule to decide if a vector belongs to the domain. If the vector belongs to the domain space, its non-significant components (those with standard deviation equal 0) should be zeroed by the matrix transformation; if they are not, we assume that the vector is an outlier. We use a somewhat ad-hoc threshold to check if the values are different from zero (if the absolute value is smaller than the threshold, it is assumed to be zero). The threshold is equal to 3 * sqrt(.Machine$double.eps) * sd(PC1), where sqrt(.Machine$double.eps) is a value mentioned in the previous paragraph and sd(PC1) is a standard deviation of the first principal component.

Fig.1: An example of dimensionality reduction and outliers detection in PCA. On the left we have a situation before and on the right after the PCA transformation. The black dots corresponds to two-dimensional samples from the training data set (situated on a single line) which are used to set the transformation parameters. The red point is an outlier point from the testing set, and the green point is a non-outlier point from the testing set. It can be seen that the two-dimensional samples from the training set are reduced to one-dimensional samples by the rotation transformation -- they all lay on the X axis now (their Y coordinate equals 0). The same goes for the green non-outlier point from the testing set. The red outlier point, on the other hand, does not have its Y coordinate equal 0.

Technical details

There are two methods of obtaining the transformation matrix implemented. The first is based on the SVD and the second is based on the eigenvalue decomposition. The second one as the one more numerically stable is used by default.

Requirements

  • The library uses the Jama matrix library (included in the binary package) for linear algebra operations.
  • If you want to compile the library from the source, you have to have the "apache ant" and "apache ivy" tools installed in the system.

Reliability

The default SVD-based version of the algorithm is checked to return the same results as the the implementation of the PCA from the R statistical environment -- the prcomp function (see the unit tests in the source code). Some of the columns of the transformed testing data matrix might be multiplied by -1 when compared with data returned by the R's prcomp. This is nothing to be worried about since both versions of the matrix are correct outputs of the PCA transformation and both are essentially equivalent as far as data processing applications are concerned.

License and conditions of use

The wrapper program is available under permissive MIT license (see file MIT-LICENSE.txt in the distribution package for the text of the license).

Example usage

Here is an example usage from one of the library's source files (src/pca_transform/SampleRun.java):

package pca_transform;

import Jama.Matrix;

public class SampleRun {
    public static void main(String[] args){
        System.out.println("Running a demonstrational program on some sample data ...");
        Matrix trainingData = new Matrix(new double[][] {
            {1, 2, 3, 4, 5, 6},
            {6, 5, 4, 3, 2, 1},
            {2, 2, 2, 2, 2, 2}});
        PCA pca = new PCA(trainingData);
        Matrix testData = new Matrix(new double[][] {
                {1, 2, 3, 4, 5, 6},
                {1, 2, 1, 2, 1, 2}});
        Matrix transformedData =
            pca.transform(testData, PCA.TransformationType.WHITENING);
        System.out.println("Transformed data:");
        for(int r = 0; r < transformedData.getRowDimension(); r++){
            for(int c = 0; c < transformedData.getColumnDimension(); c++){
                System.out.print(transformedData.get(r, c));
                if (c == transformedData.getColumnDimension()-1) continue;
                System.out.print(", ");
            }
            System.out.println("");
        }
    }
}

This results in the following output:

Running a demonstrational program on some sample data ...
Transformed data:
-0.9999999999999998, -0.5773502691896267
-0.08571428571428596, 1.7320508075688776

This program can be run by executing the library's JAR file from the binary distribution package.

Something went wrong with that request. Please try again.