Skip to content
A Julia module that implements the (normalized) iterative hard thresholding algorithm(IHT) of Blumensath and Davies. IHT performs feature selection akin to LASSO- or MCP-penalized regression using a greedy selection approach.
Julia
Branch: master
Clone or download
klkeys Merge pull request #6 from biona001/master
Doubly sparse projection for MendelIHT
Latest commit 78faf75 Aug 5, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs fixed short tutorial Mar 6, 2018
sim REQUIRE julia 0.6 Mar 6, 2018
src
test minor changes by kevin's suggestion Aug 3, 2018
.gitignore update to continuous integration files Jul 10, 2018
.travis.yml update to continuous integration files Jul 10, 2018
LICENSE.md corrected my name in LICENSE.md Oct 23, 2015
README.md
REQUIRE MendelIHT now passes tests, errors throw with @Assert, small cleanup … Jul 11, 2018
appveyor.yml update to continuous integration files Jul 10, 2018

README.md

IHT

A Julia module that implements the (normalized) iterative hard thresholding algorithm (IHT) of Blumensath and Davies. IHT performs feature selection akin to LASSO- or MCP-penalized regression using a greedy selection approach.

Installation

IHT.jl is not registered in METADATA. It depends on other unregistered packages:

  1. PLINK.jl
  2. SnpArrays.jl
  3. Search.jl
  4. MendelBase.jl

At the Julia REPL, execute

Pkg.clone("https://github.com/klkeys/PLINK.jl.git")
Pkg.clone("https://github.com/OpenMendel/SnpArrays.jl.git")
Pkg.clone("https://github.com/OpenMendel/Search.jl.git")
Pkg.clone("https://github.com/OpenMendel/MendelBase.jl.git")
Pkg.clone("https://github.com/klkeys/IHT.jl.git")

The order of installation is important!

Basic use

Given a data matrix x, a continuous response y, and a number k of desired predictors, we run IHT with the simple command

output = L0_reg(x, y, k)

Here output is an IHTResults container object with the following fields:

  • loss is the optimal loss function value (minimum residual sum of squares)
  • iter is the number of iterations until convergence
  • time is the time spent in computations
  • beta is the vector of the optimal statistical model.

IHT.jl also facilitates crossvalidation for the best model size. We perform q-fold crossvalidation via

cv_output = cv_iht(x, y)

where cv_output is an IHTCrossvalidationResults container object with the following fields:

  • mses contains the mean squared errors for each model size.
  • b contains the estimated coefficients at the optimal statistical model.
  • bidx contains the indices of the predictors in the best crossvalidated model.
  • k is the best crossvalidated model size.

Important optimal arguments to cv_iht include:

  • path, an Int vector containing the path of model sizes to test. Defaults to collect(1:p), with p = size(x,2).
  • q, the number of crossvalidation folds to use. The default depends on the Julia variable Sys.CPU_CORES, but it usually equals one of 3, 4, or 5.
  • folds, a DenseVector{Int} object to assign data to each fold.
  • pids, the Int vector of process IDs to which we distribute computations. The default is to include all available processes via pids = procs().
  • refit, a Bool to determine whether or not to refit the model. The default is refit = true.

To fix the fold for each element of y, the user must pass a prespecified Int vector to folds. For example, to perform 3-fold crossvalidation on a vector y with 30 elements, folds should be a 30-element vector with ten 1s, ten 2s, and ten 3s:

n = 30
q = 3
k = div(n,q)
y = randn(n)
folds = ones(Int, n)
folds[k+1:2k] = 2
folds[2k+1:end] = 3

IHT.jl can perform simple random assignment of folds using a subroutine from RegressionTools.jl:

folds = RegressionTools.cv_get_folds(y, q) # use the phenotype vector...
folds = Regressiontools.cv_get_folds(n, q) # ...or use the length of that vector

GWAS

IHT.jl interfaces with PLINK.jl to enable feature selection over GWAS data in PLINK binary format. The interface to L0_reg is unchanged:

output = L0_reg(x::BEDFile, y, k)

However, the call to cv_iht changes in order to accomodate parallel computing with SharedArrays. The genotype data must be stored in PLINK binary. The phenotype vector must be stored as a standard binary file. In contrast, the matrix of nongenetic covariates is read from a delimited text file. Instead of using variables from the Julia REPL, cv_iht uses file paths to the data stored in file:

cv_output = cv_iht(xfile.bed, covfile.txt, yfile.bin)

Here xfile points to the genotype data, covfile points to the covariates, and yfile points to the response vector. Note the file extensions! They are important for correct initialization of the data. See the documentation of PLINK.jl for details about the BEDFile object.

Open Mendel users

IHT.jl also runs GWAS using control file as inputs, similar to all other Open Mendel packages. Run MendelIHT using the following command:

using IHT
result = MendelIHT("gwas 1 Control.txt")

VERY IMPORTANT: The current implementation of MendelIHT assumes there are no missing genotypes since it uses linear algebra functions defined in SnpArrays.jl. Therefore, you must first impute missing genotypes before you use MendelIHT. SnpArrays.jl offer some naive imputation strategy, but otherwise, we recommend using Option 23 of Mendel.

Input files for MendelIHT

The MendelIHT analysis package uses the following input files. Example input files can be found in the test subfolder.

  • Control File: Specifies the names of your data input and output files and any optional parameters (keywords) for the analysis.
  • Pedigree File: Gives information about your individuals, such as name, parental information, and sex.
  • SNP Definition File: Defines your SNPs with information such as SNP name, chromosome, position, and allele names.
  • SNP Data File: Holds the genotypes for your data set and must be a standard binary PLINK BED file in SNP major format. If you have a SNP data file, you must also have a SNP definition file.

GPU acceleration

IHT.jl interfaces with the GPU accelerator from PLINK.jl. The GPU accelerator farms the calculation of the gradient to a GPU, which greatly improves computational performance. L0_reg needs to know where the GPU kernels are stored. PLINK.jl preloads the 64-bit kernels into a variable PLINK.gpucode64.

output   = L0_reg(x::BEDFile, y, k, PLINK.gpucode64)

PLINK.jl ships with kernels for Float32 (32-bit) and Float64 (64-bit) arrays. The corresponding kernel files are housed in PLINK.gpucode64 and PLINK.gpucode32. These are the only kinds of arrays supported by IHT.jl. Use of Float32 arithmetic yields faster execution times but may suffer from numerical underflow. IHT.jl enforces the same precision for both x and y; mixing single- and double-precision arrays is not allowed.

IHT.jl performs crossvalidation with GPUs via

cv_output = cv_iht(xfile, covfile, yfile, PLINK.gpucode64)

Crossvalidation with GPUs is a complicated topic. To summarize, IHT farms an entire copy of the data to each host process. For IHT.jl, a host process corresponds to a single crossvalidation fold, so q-fold crossvalidation will usually entail q processes. OpenCL memory constraints dictate that each process should have its own copy of the data on the GPU device. NOTA BENE: IHT.jl currently makes no effort to ensure that the GPU contains sufficient memory for q copies of the data. Users must consider device memory limits when calling cv_iht with GPUs. Exceeding device memory can yield cryptic OpenCL errors regarding unallocatable buffers.

You can’t perform that action at this time.