IPCAPS
is an unsupervised clustering algorithm based on iterative
pruning to capture population structure. This version supports ordinal
data which can be applied directly to SNP data to identify fine-level
population structure and it is built on the iterative pruning Principal
Component Analysis (ipPCA) algorithm by Intarapanich et al. (2009)
<doi: 10.1186/1471-2105-10-382> and Limpiti et al. (2011)<doi:
10.1186/1471-2105-12-255>. The IPCAPS involves an iterative process
using multiple splits based on multivariate Gaussian mixture modeling of
principal components and Clustering EM estimation as in Lebret et
al. (2015) <doi: 10.18637/jss.v067.i06>. In each iteration, rough
clusters and outliers are also identified using the function
rubikclust()
from the R package KRIS
.
You can install the released version of IPCAPS from CRAN with:
install.packages("IPCAPS")
Alternatively, you can install the dev version of IPCAPS from Github with
install.packages("remotes")
remotes::install_github("kridsadakorn/ipcaps", dependencies = TRUE)
You can see the reference manual from: https://www.biostatgen.org/ipcaps/
Fine-scale subpopulation detection via an SNP-based unsupervised method: A case study on the 1000 Genomes Project resources
Kridsadakorn Chaichoompu, Alisa Wilantho, Pongsakorn Wangkumhang, Sissades Tongsima, Bruno Cavadas, Luísa Pereira, and Kristel Van Steen
You can see the extra supplementary information from: https://www.biostatgen.org/paper_ipcaps/
This is a basic example which shows you how to use the packages:
library(IPCAPS)
BED.file <- system.file("extdata", "ipcaps_example.bed", package = "IPCAPS")
LABEL.file <- system.file("extdata", "ipcaps_example_individuals.txt.gz",
package = "IPCAPS")
my.cluster1 <- ipcaps(bed = BED.file, label.file = LABEL.file, lab.col = 2,
out = tempdir())
#> Running ... IPCAPS
#> output: /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ
#> label file: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/IPCAPS/extdata/ipcaps_example_individuals.txt.gz
#> label column: 2
#> threshold: 0.18
#> minimum Fst: 8e-04
#> maximum thread: 1
#> method: mix
#> bed: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/IPCAPS/extdata/ipcaps_example.bed
#> minimum in group: 20
#> data type: snp
#> missing: NA
#> In preprocessing step
#> Note: the directory '/var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ' is existed.
#> The result files will be saved to this directory: /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output
#> Input data: 103 individuals, 2000 markers
#> Start calculating
#> Node 1: Start the process
#> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/condition.RData
#> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/rawdata.RData
#> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node1.RData
#> Node 1: Check for splitting
#> Node 1: Reducing matrix
#> Node 1: EigenFit = 0.652486925995028, Threshold = 0.18, no. significant PCs = 3
#> Node 1: Start clustering
#> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node1.RData
#> Node 1: done for clustering
#> Time difference of 0.1221311 secs
#> Node 1: Checking Fst of group 1 and group 2
#> Node 1: Checking Fst of group 1 and group 3
#> Node 1: Checking Fst of group 1 and group 4
#> Node 1: Checking Fst of group 2 and group 3
#> Node 1: Checking Fst of group 2 and group 4
#> Node 1: Checking Fst of group 3 and group 4
#> Node 1 times took 0.350982904434204 secs
#> Node 1: 103 individuals were splitted into: 4 groups
#> Node 1: Return status 0
#> Node 1: Split to sub-nodes
#> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node2.RData
#> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node3.RData
#> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node4.RData
#> Node 1: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node5.RData
#> Node 1: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 1: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 1: Done!
#> Time difference of 0.6124501 secs
#> Node 2: Start the process
#> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/condition.RData
#> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/rawdata.RData
#> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node2.RData
#> Node 2: Check for splitting
#> Node 2: A number of node is lower than the minimum number (20), therefore split was not performed
#> Node 2: Return status 1
#> Node 2: No split was performed, Status=1
#> Node 2: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/leafnode.RData
#> Node 2: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 2: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 2: Done!
#> Time difference of 0.03211594 secs
#> Node 3: Start the process
#> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/condition.RData
#> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/rawdata.RData
#> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node3.RData
#> Node 3: Check for splitting
#> Node 3: A number of node is lower than the minimum number (20), therefore split was not performed
#> Node 3: Return status 1
#> Node 3: No split was performed, Status=1
#> Node 3: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/leafnode.RData
#> Node 3: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 3: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 3: Done!
#> Time difference of 0.03001809 secs
#> Node 4: Start the process
#> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/condition.RData
#> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/rawdata.RData
#> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node4.RData
#> Node 4: Check for splitting
#> Node 4: A number of node is lower than the minimum number (20), therefore split was not performed
#> Node 4: Return status 1
#> Node 4: No split was performed, Status=1
#> Node 4: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/leafnode.RData
#> Node 4: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 4: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 4: Done!
#> Time difference of 0.02864623 secs
#> Node 5: Start the process
#> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/condition.RData
#> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/rawdata.RData
#> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node5.RData
#> Node 5: Check for splitting
#> Node 5: Reducing matrix
#> Node 5: EigenFit = 0.343228072167861, Threshold = 0.18, no. significant PCs = 3
#> Node 5: Start clustering
#> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node5.RData
#> Node 5: done for clustering
#> Time difference of 0.1035459 secs
#> Node 5: Checking Fst of group 1 and group 2
#> Node 5 times took 0.153268098831177 secs
#> Node 5: 100 individuals were splitted into: 2 groups
#> Node 5: Return status 0
#> Node 5: Split to sub-nodes
#> Node 5: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node6.RData
#> Node 5: Saving /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node7.RData
#> Node 5: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 5: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 5: Done!
#> Time difference of 0.3028221 secs
#> Node 6: Start the process
#> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/condition.RData
#> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/rawdata.RData
#> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node6.RData
#> Node 6: Check for splitting
#> Node 6: Reducing matrix
#> Node 6: EigenFit = 0.0221170820186458, Threshold = 0.18, no. significant PCs =
#> Node 6: No split was performed because EigenFit is lower than threshold
#> Node 6: Return status 1
#> Node 6: No split was performed, Status=1
#> Node 6: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/leafnode.RData
#> Node 6: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 6: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 6: Done!
#> Time difference of 0.1033342 secs
#> Node 7: Start the process
#> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/condition.RData
#> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/rawdata.RData
#> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/node7.RData
#> Node 7: Check for splitting
#> Node 7: Reducing matrix
#> Node 7: EigenFit = 0.0367738545252401, Threshold = 0.18, no. significant PCs =
#> Node 7: No split was performed because EigenFit is lower than threshold
#> Node 7: Return status 1
#> Node 7: No split was performed, Status=1
#> Node 7: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/leafnode.RData
#> Node 7: Loading /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 7: Updating /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/RData/tree.RData
#> Node 7: Done!
#> Time difference of 0.1264 secs
#> In post process step
#> Exporting node 2 as group 1
#> Exporting node 3 as group 2
#> Exporting node 4 as group 3
#> Exporting node 6 as group 4
#> Exporting node 7 as group 5
#> Note: save as /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output/groups.txt
#> The result files were saved at: /var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output
#> Total runtime is 2.32541418075562 sec
The function ipcaps
does unsupervised clusering, and here is the
result:
table(my.cluster1$cluster$label, my.cluster1$cluster$group)
#>
#> 1 2 3 4 5
#> outlier3 1 1 1 0 0
#> pop1 0 0 0 50 0
#> pop2 0 0 0 0 50
The output directory will be indicated in the console or in
my.cluster1$output.dir
. All result files are saved at:You can naviage
to check the html
visualizations in the output directory.
print(my.cluster1$output.dir)
#> [1] "/var/folders/sp/hhmj9xvx53z4g4dktf5f503r0000gp/T//RtmplUe8ZJ/cluster_output"
list.files(my.cluster1$output.dir)
#> [1] "groups.txt" "images"
#> [3] "RData" "tree_scatter_cluster.html"
#> [5] "tree_scatter_label.html" "tree_scree.html"
#> [7] "tree_text.html"