NaivePCA performs PCA for population genotype data. It implements the basic algorithm as is described by Patterson et al (2006). More precisely, suppose we have m samples of ploidy h and n biallelic markers. Let Gij be the number of non-reference alleles for sample i at marker j. NaivePCA computes:
\mu_j = \sum_{i=1}^m G_{ij} / m
p_j = \mu_j / h
M_{ij} = \frac{G_{ij}-\mu_j}{\sqrt{p_j(1-p_j)}}
X_{ij} = \sum_{k=1}^n M_{ik} M_{jk} / n
and finds the eigenvectors of matrix (Xij). Notably, if Gij is missing data, Mij takes zero and the computation of μj needs to be adjusted as well.
The input of NaivePCA looks like:
sample1 110022110100202021001122*201
sample2 2012201102*221020211*1222001
where a number represents a genotype and other characters are treated as missing data. For now, NaivePCA does not support real matrices. The output is TAB-delimited. The first column is the sample name. The i-th column gives the eigenvector corresponding the (i-1)-th largest eigenvalue.