In [1]:
from helpers.utilities import *
%run helpers/notebook_setup.ipynb

While attempting to compare limma's results for log-transformed an non-transformed data, it was noticed (and brought up by Dr Tim) That the values of logFC produced by limma for non-transformed data are of wrong order of magnitude.

I have investigated the issue, following the limma calculations for non-transformed data step by step:

In [2]:
indexed_by_target_path = 'data/clean/protein/indexed_by_target.csv'
clinical_path = 'data/clean/protein/clinical_data_ordered_to_match_proteins_matrix.csv'

In [3]:
clinical = read_csv(clinical_path, index_col=0)
raw_protein_matrix = read_csv(indexed_by_target_path, index_col=0)

In [4]:
by_condition = clinical.Meningitis

In [5]:
tb_lysozyme = raw_protein_matrix[
    raw_protein_matrix.columns[by_condition == 'Tuberculosis']
].loc['Lysozyme'].mean()

In [6]:
hc_lysozyme = raw_protein_matrix[
    raw_protein_matrix.columns[by_condition == 'Healthy control']
].loc['Lysozyme'].mean()

In [7]:
tb_lysozyme / hc_lysozyme

3.14204409697397

In [8]:
tb_lysozyme

90648.31153846152

In [9]:
hc_lysozyme

28850.108000000004

While for the transformed data:

In [10]:
from numpy import log10

In [11]:
log10(tb_lysozyme)

4.957359719117039

In [12]:
log10(hc_lysozyme)

4.460147443270476

In [13]:
log10(tb_lysozyme) / log10(hc_lysozyme)

1.1114788876759585

In [14]:
protein_matrix = raw_protein_matrix.apply(log10)

In [15]:
%%R -i protein_matrix -i by_condition
import::here(space_to_dot, dot_to_space, .from='helpers/utilities.R')
import::here(
    limma_fit, limma_diff_ebayes, full_table,
    design_from_conditions, calculate_means,
    .from='helpers/differential_expression.R'
)

diff_ebayes = function(a, b, data=protein_matrix, conditions_vector=by_condition, ...) {
    limma_diff_ebayes(a, b, data=data, conditions_vector=conditions_vector, ...)
}

In [16]:
%%R -o tb_all_proteins_raw -i raw_protein_matrix
result = diff_ebayes('Tuberculosis', 'Healthy control', data=raw_protein_matrix)
tb_all_proteins_raw = full_table(result)

In [17]:
%%R
head(full_table(result, coef=1))

                              logFC   AveExpr        t      P.Value
Lysozyme                   61798.20  67997.26 12.34222 3.414236e-20
TIMP-1                     65320.26  89128.78 11.82749 3.121111e-19
IGFBP-4                   104840.02 186800.74 11.56193 9.882769e-19
C3d                       124850.49  99248.92 11.43494 1.719287e-18
Cyclophilin A             130136.76 117191.29 11.15072 5.970601e-18
14-3-3 protein zeta/delta 141689.40 105857.89 10.58352 7.404860e-17
                             adj.P.Val         B                   protein
Lysozyme                  4.455578e-17 -4.254329                  Lysozyme
TIMP-1                    2.036525e-16 -4.264678                    TIMP-1
IGFBP-4                   4.299004e-16 -4.270296                   IGFBP-4
C3d                       5.609172e-16 -4.273051                       C3d
Cyclophilin A             1.558327e-15 -4.279385             Cyclophilin A
14-3-3 protein zeta/delta 1.509659e-14 -4.292907 14-3-3 protein zeta/delta

In [18]:
%%R
# logFC is taken from the coefficient of fit (result):
# it seems that the coefficients do not represent the FC as would expected...
result$coefficients['Lysozyme', ]

[1] 61798.2


We can trace it back to:

In [19]:
%%R
fit = limma_fit(
    data=raw_protein_matrix, conditions_vector=by_condition,
    a='Tuberculosis', b='Healthy control'
)

In [20]:
%%R
fit$coefficients['Lysozyme', ]

[1] 61798.2


It changes when using using only the data from TB and HC, though it continues to produce large values:

In [21]:
%%R
fit = limma_fit(
    data=raw_protein_matrix, conditions_vector=by_condition,
    a='Tuberculosis', b='Healthy control', use_all=F
)

In [22]:
%%R
fit$coefficients['Lysozyme', ]

Intercept     Group 
 59749.21  30899.10 


Getting back to the previous version, we can see that the meansare correctly calculated:

In [23]:
%%R
design <- design_from_conditions(by_condition)
fit <- calculate_means(raw_protein_matrix, design)

In [24]:
%%R
fit$coefficients['Lysozyme', ]

    (Intercept) Healthy.control    Tuberculosis           Viral 
       84617.54       -55767.43         6030.77       -17925.30 


In [25]:
tb_lysozyme, hc_lysozyme

(90648.31153846152, 28850.108000000004)

In [26]:
%%R
contrast_specification <- paste(
    space_to_dot('Tuberculosis'),
    space_to_dot('Healthy control'),
    sep='-'
)
contrast.matrix <- limma::makeContrasts(contrasts=contrast_specification, levels=design)
contrast.matrix

                 Contrasts
Levels            Tuberculosis-Healthy.control
  Intercept                                  0
  Healthy.control                           -1
  Tuberculosis                               1
  Viral                                      0


There is only one step more:

> fit <- limma::contrasts.fit(fit, contrast.matrix)

so the problem must be here

In [27]:
%%R
fit_contrasted <- limma::contrasts.fit(fit, contrast.matrix)
fit_contrasted$coefficients['Lysozyme', ]

[1] 61798.2


Note the result we got: 61798.20 is:

In [28]:
tb_lysozyme - hc_lysozyme

61798.203538461516

In [29]:
%%R
final_fit = limma::eBayes(fit_contrasted, trend=T, robust=T)
final_fit$coefficients['Lysozyme', ]

[1] 61798.2


This shows that limma does not produce the fold change at all.

This is because it assumes that the data are log-transformed upfront. **If we gave it log-transformed data, the difference of logs would be equivalent to division.**