CurrentModule = PlmDCA
- Pseudo-likelihood maximization for protein.
- Learn pseudo-likelihood model from multiple sequence alignment
See the [Index](@ref index) for the complete list of documented functions and types.
Protein families are given in form of multiple sequence alignments (MSA)
We start from the exact decomposition
Here, we use the following parametrization:
where:
is the normalization factor. The pseudo-likelihood maximization strategy, aims
at finding the value of the
In machine learning, this parametrization is known as the soft-max regression, a generalization of logistic regression to multi-class labels.
The typical pipeline to use the package is:
- Compute PlmDCA parameters from a multiple sequence alignment:
julia> res=plmdca(filefasta; kwds...)
To fully exploit the the multicore parallel computation, julia should be invoked with
$ julia -t nthreads # put here nthreads equal to the number of cores you want to use
If you want to set permanently the number of threads to the desired value, you can either create a default environment variable export JULIA_NUM_THREADS=24
in your .bashrc
. More information here
The package is on the General Registry. It can be installed from the package manager by
pkg> add PlmDCA
and
julia> using PlmDCA
There are two different learning strategies:
- The asymmetric one invoked by the
plmdca_asym
method (theplmdca
method points to the asymmetric strategy) - The symmetric strategy, invoked by the
plmdca_sym
. This method is slower and typically less accurate.
Both methods return the parameters res::PlmOut
, a struct
containing:
Jtensor
: the 4 dimensional Array (q×q×L×L) of the couplings in zero-sum gauge J[a,b,i,j]Htensor
: the 2 dimensional Array (q×L) of the fields in zero-sum gauge h[a,i]pslike
: a vector of lengthL
containing the log-pseudolikelihoodsscore
: a vector of ``(i,j,val)::Tuple{Int,Int,Float64}containing the DCA score
val` relative to the `i,j` pair in descending order.
For both methods, the keyword arguments (with their default value) are:
epsconv::Real=1.0e-5
(convergence parameter)maxit::Int=1000
(maximum number of iteration - don't change)verbose::Bool=true
(set tofalse
to suppress printing on screen)method::Symbol=:LD_LBFGS
(optimization method)
res=plmdca("data/PF14/PF00014_mgap6.fasta.gz", verbose=false, lambdaJ=0.02,lambdaH=0.001);
Contact prediction is contained in the the output of plmdca
. ::Output
contains a score
field which is a Vector
of Tuple
ranked in descending score order. Each Tuple
contains
Modules = [PlmDCA]