# Information API

- [Types](#Types)
    - [Pseudocounts](#Pseudocounts)
    - [Contingency tables](#Contingency-tables)
    - [Information-measures](#Information-measures)
    - [Imported from MIToS.MSA](#Imported-from-MIToS.MSA)
- [Constants](#Constants)
- [Methods](#Methods)
    - [Working with Contingency tables](#Working-with-Contingency-tables)
    - [Estimate](#Estimate)
    - [Measures](#Measures)
    - [Imported from Base](#Imported-from-Base)

In [1]:
using MIToS.Information

INFO: Recompiling stale cache file /home/diego/.julia/lib/v0.4/MIToS.ji for module MIToS.


In [2]:
?MIToS.Information

The `Information` module of MIToS defines types and functions useful to calculate information measures (e.g. *Mutual Information* (MI) and *Entropy*) over a Multiple Sequence Alignment (MSA). This module was designed to count `Residue`s (defined in the `MSA` module) in special contingency tables (as fast as possible) and to derive probabilities from this counts. Also, includes methods for applying corrections to that tables, e.g. pseudocounts and pseudo frequencies. Finally, `Information` allows to use this probabilities and counts to estimate information measures and other frequency based values.

**Features**

  * Estimate multi dimensional frequencies and probabilities tables from sequences, MSAs, etc...
  * Correction for small number of observations
  * Correction for data redundancy on a MSA
  * Estimate information measures
  * Calculate corrected mutual information between residues

```julia

using MIToS.Information
```


<div class="panel panel-info">
    <div class="panel-heading">
        <strong>Julia help mode</strong>
    </div>
    <div class="panel-body">
        <p>If you type <code>?</code> at the beginning of the Julia REPL line, you will enter in the Julia help mode. In this mode, Julia prints the help or <strong>documentation</strong> of the entered element. This is a nice way of getting information about MIToS functions, types, etc. from Julia.</p>
    </div>
</div>

<a href="#"><i class="fa fa-arrow-up"></i></a>

## Types

In [3]:
?MIToS.Information.SequenceWeights

`SequenceWeights` is an alias for `Union{ClusteringResult, AbstractVector}`.

This is the type of the keyword argment `weight` of `count`, `probabilities` and derived functions. The type should define `getweight` to be useful in those functions.


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Pseudocounts

In [4]:
Docs.typesummary( MIToS.Information.Pseudocount )



**Summary:**

```julia
abstract MIToS.Information.Pseudocount{T<:Real} <: Any
```

**Subtypes:**

```julia
MIToS.Information.AdditiveSmoothing{T<:Real}
```


In [5]:
?MIToS.Information.AdditiveSmoothing

**Additive Smoothing** or fixed pseudocount  `λ`  for `ResidueCount` (in order to estimate probabilities when the number of samples is low).

Common values of `λ` are:

  * `0` :  No cell frequency prior, gives you the maximum likelihood estimator.
  * `0.05` is the optimum value for `λ` found in Buslje et. al. 2009, similar results was obtained for `λ` in the range [0.025, 0.075].
  * `1 / p` : Perks prior (Perks, 1947) where `p` the number of parameters (i.e. residues, pairs of residues) to estimate. If `p` is the number of residues (`20` without counting gaps), this gives you `0.05`.
  * `sqrt(n) / p` : Minimax prior (Trybula, 1958) where `n` is the number of samples and `p` the number of parameters to estimate.  If the number of samples `n` is 400 (minimum number of sequence clusters for achieve good performance in Buslje et. al. 2009) for estimating 400 parameters (pairs of residues without counting gaps) this gives you `0.05`.
  * `0.5` : Jeffreys prior (Jeffreys, 1946).
  * `1` : Bayes-Laplace uniform prior, aka. Laplace smoothing.


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Contingency tables

In [6]:
Docs.typesummary( MIToS.Information.ResidueContingencyTables )

**Summary:**

```julia
abstract MIToS.Information.ResidueContingencyTables{T,N,UseGap} <: AbstractArray{T,N}
```

**Subtypes:**

```julia
MIToS.Information.ResidueCount{T,N,UseGap}
MIToS.Information.ResidueProbability{T,N,UseGap}
```


In [7]:
?MIToS.Information.ResidueCount

`ResidueCount{T, N, UseGap}` is used for counting residues in columns (or sequences) of an MSA. `N` is the dimensionality and should be an `Int`, i.e. 2 if 2 columns are used for counting pairs. `UseGap` is a `Bool`, `true` means that **ResidueCount** counts gaps in the position 21.

  * The field marginal is used for pre allocation of marginal sums.
  * The field total is used for storing the table sum.


In [8]:
?MIToS.Information.ResidueProbability

`ResidueProbability{T, N, UseGap}` is used to store residue probabilities. `N` is the dimensionality and should be an `Int`, e.g. 2 to store probabilities of residue pairs. `UseGap` is a `Bool`, `true` means that gap probabilities are stored in the position 21 of each dimension.

  * The field marginal is used for pre allocation of marginal sums.


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Information measures

In [9]:
Docs.typesummary( MIToS.Information.AbstractMeasure )

**Summary:**

```julia
abstract MIToS.Information.AbstractMeasure{T} <: Any
```

**Subtypes:**

```julia
MIToS.Information.SymmetricMeasure{T}
```


In [10]:
Docs.typesummary( MIToS.Information.SymmetricMeasure )

**Summary:**

```julia
abstract MIToS.Information.SymmetricMeasure{T} <: MIToS.Information.AbstractMeasure{T}
```

**Subtypes:**

```julia
MIToS.Information.Entropy{T}
MIToS.Information.GapIntersectionPercentage{T}
MIToS.Information.GapUnionPercentage{T}
MIToS.Information.MutualInformationOverEntropy{T}
MIToS.Information.MutualInformation{T}
```


In [11]:
?MIToS.Information.Entropy

Shannon entropy (H)


In [12]:
?MIToS.Information.MutualInformation

Mutual Information (MI)


In [13]:
?MIToS.Information.MutualInformationOverEntropy

Normalized Mutual Information (nMI) by Entropy.

`nMI(X, Y) = MI(X, Y) / H(X, Y)`


In [14]:
?MIToS.Information.GapUnionPercentage

`GapUnionPercentage`


In [15]:
?MIToS.Information.GapIntersectionPercentage

`GapIntersectionPercentage`


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Imported from MIToS.MSA

In [16]:
?MIToS.Information.Raw

No documentation found.

**Summary:**

```julia
immutable MIToS.MSA.Raw <: MIToS.Utils.Format
```


In [17]:
?MIToS.Information.Stockholm

No documentation found.

**Summary:**

```julia
immutable MIToS.MSA.Stockholm <: MIToS.Utils.Format
```


In [18]:
?MIToS.Information.FASTA

No documentation found.

**Summary:**

```julia
immutable MIToS.MSA.FASTA <: MIToS.Utils.Format
```


<a href="#"><i class="fa fa-arrow-up"></i></a>

## Constants

In [19]:
?MIToS.Information.BLOSUM62_Pi

BLOSUM62 probabilities *P(aa)* for each residue. SUM:  0.9987


In [20]:
?MIToS.Information.BLOSUM62_Pij

Normalization is done row based. The firts row contains the *P(aa|A)* and so one...

`     A      R      N      D      C      Q      E      G      H      I      L      K      M      F      P      S      T      W      Y      V`


<a href="#"><i class="fa fa-arrow-up"></i></a>

## Methods

In [21]:
?MIToS.Information.APC!

APC (Dunn et. al. 2008)


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Working with Contingency tables

In [22]:
?MIToS.Information.probabilities

`probabilities(T, α, β, res1, res2, [weight])` use BLOSUM62 based pseudofrequencies. α is the weight of the evidence, and β the weight of the pseudofrequencies.

```
probabilities(res::AbstractVector{Residue}...; usegap=false, weight=NoClustering())
probabilities(pseudocount::Pseudocount, res::AbstractVector{Residue}...; usegap=false, weight=NoClustering())
```

`probabilities` creates a new ResidueProbability with the probabilities of residues, pairs of residues, etc. in the sequences/columns.


In [23]:
?MIToS.Information.count

```
count(p, itr) -> Integer
```

Count the number of elements in `itr` for which predicate `p` returns `true`.

```
count(res::AbstractVector{Residue}...; usegap=false, weight=NoClustering())
count(pseudocount::Pseudocount, res::AbstractVector{Residue}...; usegap=false, weight=NoClustering())
```

`count` creates a new ResidueCount counting the number of residues, pairs of residues, etc. in the sequences/columns.


In [24]:
?MIToS.Information.count!

`count!` adds counts from vector of residues to a `ResidueCount` object. It can take a SequenceWeights object as second argument.

```julia

julia> using MIToS.Information

julia> using MIToS.MSA

julia> seq = Residue[ i for i in 1:20];

julia> Ni = count(seq)
20-element MIToS.Information.ResidueCount{Float64,1,false}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

julia> count!(Ni, seq)
20-element MIToS.Information.ResidueCount{Float64,1,false}:
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0

```


In [25]:
?MIToS.Information.normalize!

`normalize!(p::ResidueProbability; updated::Bool=false)`

This function makes the sum of the probabilities to be one. The sum is calculated using the `probabilities` field by default (It is assumed that the marginal are not updated). The marginals are updated in the normalization.

If the marginals are updated, you can use `updated=true` for a faster normalization.


In [26]:
?MIToS.Information.nresidues

No documentation found.

`MIToS.Information.nresidues` is a generic `Function`.

```julia
# 2 methods for generic function "nresidues":
nresidues{T,N}(n::MIToS.Information.ResidueContingencyTables{T,N,true}) at /home/diego/.julia/v0.4/MIToS/src/Information/Probabilities.jl:74
nresidues{T,N}(n::MIToS.Information.ResidueContingencyTables{T,N,false}) at /home/diego/.julia/v0.4/MIToS/src/Information/Probabilities.jl:75
```


In [27]:
?MIToS.Information.update!

Updates the marginals values (and total value for `ResidueCount`) of a `ResidueContingencyTables`


In [28]:
?MIToS.Information.apply_pseudocount!

`apply_pseudocount!{T, N, UseGap}(n::ResidueCount{T, N, UseGap}, pse::AdditiveSmoothing{T})`

Uses an instance of `AdditiveSmoothing` to efficiently fill with a constant value each element of the table.


In [29]:
?MIToS.Information.blosum_pseudofrequencies!

`blosum_pseudofrequencies!(Gab::ResidueProbability{T, 2,false}, Pab::ResidueProbability{T, 2,false})`

This function uses the conditional probability matrix `BLOSUM62_Pij` to fill a preallocated `Gab` with pseudo frequencies. `blosum_pseudofrequencies!` also needs the real frequencies/probabilities `Pab`. This observed probabilities are then used to estimate the pseudo frequencies.

`Gab = Σcd  Pcd ⋅ BLOSUM62( a | c ) ⋅ BLOSUM62( b | d )`


In [30]:
?MIToS.Information.apply_pseudofrequencies!

`apply_pseudofrequencies!{T}(Pab::ResidueProbability{T, 2,false}, Gab::ResidueProbability{T, 2,false}, α, β)`

Apply pseudofrequencies `Gab` over `Pab`, as weighted mean. Where α is the weight of the real frequencies `Pab` and β the weight of the pseudofrequencies.

`Pab = (α ⋅ Pab + β ⋅ Gab )/(α + β)`


In [31]:
?MIToS.Information.delete_dimensions!

`delete_dimensions!(out::ResidueContingencyTables, in::ResidueContingencyTables, dimensions::Int...)`

This function fills a ResidueContingencyTables with the counts/probabilities on `in` after the deletion of `dimensions`. i.e. This is useful for getting Pxy from Pxyz.


In [32]:
?MIToS.Information.delete_dimensions

`delete_dimensions(in::ResidueContingencyTables, dimensions::Int...)`

This function creates a ResidueContingencyTables with the counts/probabilities on `in` after the deletion of `dimensions`. i.e. This is useful for getting Pxy from Pxyz.


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Estimate

In [33]:
?MIToS.Information.estimate

`estimate(MutualInformation(), pxy::ResidueCount [, base])`

Calculate Mutual Information from `ResidueCount`. The result type is determined by the `base`. It's the fastest option (you don't spend time on probability calculations).

`estimate(MutualInformation(), pxy::ResidueProbability [, base])`

Calculate Mutual Information from `ResidueProbability`. The result type is determined by `base`.

`estimate(Entropy{T}(base), n::ResidueCount)`

It's the fastest option (you don't spend time on probability calculations). The result type is determined by the `base`.

`estimate(Entropy(base), p)`

`p` should be a `ResidueProbability` table. The result type is determined by `base`.


In [34]:
?MIToS.Information.estimate_on_marginal

`estimate_on_marginal(Entropy{T}(base), p, marginal)`

This function estimate the entropy H(X) if marginal is 1, H(Y) for 2, etc. The result type is determined by `base`.


In [35]:
?MIToS.Information.estimateinsequences

This function `estimate` a `measure` over sequences or sequence pairs. It has the same arguments than `estimateincolumns`, look the documentation of the last.


In [36]:
?MIToS.Information.estimateincolumns

`estimateincolumns(aln, [count,] use, [α, β,] measure, [pseudocount,] [weight,] [usediagonal, diagonalvalue])`

This function `estimate` a `AbstractMeasure` in columns or pair of columns of a MSA.

  * `aln` : This argument is mandatory and it should be a `Matrix{Residue}`. Use the function `getresidues` (from the MSA module) over a MSA object to get the needed matrix.
  * `count` : This argument is optional. It should be defined when `use` is a `ResidueProbability` object. It indicates the element type of the counting table.
  * `use` : This argument is mandatory and indicates the sub-type of `ResidueContingencyTables` used by `estimate` inside the function. If the table has one dimension (`N`=`1`), the occurrences/probabilities are counted for each sequence/column. If the table has two dimension (`N`=`2`), pairs of sequences/columns are used. The dimension `N` and the `UseGap` parameter of `Residueount{T, N, UseGap}` or `ResidueProbability{T, N, UseGap}` determines the output and behaviour of this functions. If `UseGap` is true, gaps are used in the estimations.
  * `α` : This argument is optional, and indicates the weight of real frequencies to apply BLOSUM62 based pseudo frequencies.
  * `β` : This argument is optional, and indicates the weight of BLOSUM62 based pseudo frequencies.
  * `measure` : This argument is mandatory and indicates the measure to be used by `estimate` inside the function.
  * `pseudocount` : This argument is optional. It should be an `AdditiveSmoothing` instance (default to zero).
  * `weight` : This argument is optional. It should be an instance of `ClusteringResult` or `AbstractVector` (vector of weights). Each sequence has weight 1 (`NoClustering()`) by default.
  * `usediagonal` : This functions return a `Vector` in the one dimensional case, or a `PairwiseListMatrix` in the bidimensional case. This argument only have sense in the bidimensional case and indicates if the list on the `PairwiseListMatrix` should include the diagonal (default to `true`).
  * `diagonalvalue` : This argument is optional (default to zero). Indicates the value of output diagonal elements.


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Measures

In [37]:
?MIToS.Information.buslje09

This function takes a MSA or a file and a `Format` as first arguments. Calculates a Z score and a corrected MI/MIp as described on **Busjle et. al. 2009**

Argument, type, default value and descriptions:

```
  - lambda      Float64   0.05    Low count value
  - clustering  Bool      true    Sequence clustering (Hobohm I)
  - threshold             62      Percent identity threshold for clustering
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - usegap      Bool      false   Use gaps on statistics
  - samples     Int       100     Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples
```

This function returns:

```
  - Z score
  - MI or MIp
```


In [38]:
?MIToS.Information.BLMI

This function takes a MSA or a file and a `Format` as first arguments. Calculates a Z score (ZBLMI) and a corrected MI/MIp as described on **Busjle et. al. 2009** but using using BLOSUM62 pseudo frequencies instead of a fixed pseudocount.

Argument, type, default value and descriptions:

```
  - beta        Float64   8.512   β for BLOSUM62 pseudo frequencies
  - lambda      Float64   0.0     Low count value
  - threshold             62      Percent identity threshold for sequence clustering (Hobohm I)
  - maxgap      Float64   0.5     Maximum fraction of gaps in positions included in calculation
  - apc         Bool      true    Use APC correction (MIp)
  - samples     Int       50      Number of samples for Z-score
  - fixedgaps   Bool      true    Fix gaps positions for the random samples
```

This function returns:

```
  - Z score (ZBLMI)
  - MI or MIp using BLOSUM62 pseudo frequencies (BLMI/BLMIp)
```


In [39]:
?MIToS.Information.pairwisegapfraction

This function takes a MSA or a file and a `Format` as first arguments. Calculates the percentage of gaps on columns pairs (union and intersection) using sequence clustering (Hobohm I).

Argument, type, default value and descriptions:

  * clustering  Bool      true    Sequence clustering (Hobohm I)
  * threshold             62      Percent identity threshold for sequence clustering (Hobohm I)

This function returns:

  * pairwise gap percentage (union)
  * pairwise gap percentage (intersection)


<a href="#"><i class="fa fa-arrow-up"></i></a>

### Imported from Base

In [40]:
?MIToS.Information.fill!

```
fill!(A, x)
```

Fill array `A` with the value `x`. If `x` is an object reference, all elements will refer to the same object. `fill!(A, Foo())` will return `A` filled with the result of evaluating `Foo()` once.

`fill!{T, N, UseGap}(p::ResidueProbability{T, N, UseGap}, n::ResidueCount{T, N, UseGap}; updated::Bool=false)`

This function fills a preallocated `ResidueProbability` (`p`) with the probabilities calculated from `n` (`ResidueCount`). This function updates `n` unless `updated=true`.

If `n` is updated, you can use `updated=true` for a faster calculation.

`fill!{T, N, UseGap}(n::ResidueCount{T, N, UseGap}, pse::AdditiveSmoothing{T})` fills a preallocated `ResidueCount` (`p`) with the pseudocount (`pse`).


<a href="#"><i class="fa fa-arrow-up"></i></a>