
# DA-RED-LODs
DA-RED-LODs is a biological-based feature selection method whose baseline is supported by Open Targets Platform [1], providing a scored ranking which associates genes with a certain disease using biological evidences from all integrated sources. This score is computed as the harmonic sum of the scores associated to each data type, which in turn are computed as the harmonic sum of the scores of each data source. The calculation of each evidence score is performed taking into account the evidence frequency, the strength of the effect described by it and its confidence [2]. On the basis of considering those related scores, evidences as well as LOD score, proposed method is built.

DA-RED-LOD is an iterative algorithm that follows the principle of minimum redundance maximum relevance in biological terms and whose  functional diagram can be seen in the following figure.

![**DA-RED-LODs diagram.** Overall flowchart of DA-RED-LOD diagram. It describes the workflow of DA-RED-LOD method](imgs/daredfs.png)


The starting point is an empty set of selected genes, $S_G$, a set of possible genes, $G$, a specific or general disease, $D$, and, in the case of dealing with a multiclass problem, a set of pathologies, $D_P$. 

In the first step, the gene with the highest relevancy is selected, being the first gene added to the selected genes set $S_G$. In the following steps, genes scores are calculated by the following equation, and the gene with maximun score in each step is selected and added to $S_G$. 

\begin{equation} 
SCORE(g,S_G,D,D_P) =  Rel(g,D,D_P) - \sum_{g_i \in S_G} \frac{RED(g,g_i,D)}{|S_G|}
\end{equation}

Where $Rel(g,D,D_P)$ is the relevancy of the gene $g$ in relation with the disease $D$ and possible sub-diseases in $D_P$, $RED(g,g_i,D)$ is the redundancy of gene $g$ over gene $g_i$ in relation with a disease $D$. Relevance and redundancy concepts are explained in detail in the following subsections.

## Relevancy.

The relevance of the genes is calculated taking into account both the LOD score in relation to the study disease and its Disease Association (DA) score. 

Given a gene $g$, a disease $D$ and an optional set of pathologies $D_p$, the relevance of $g$ with respect to $D$ is calculated by the following equation.

\begin{equation} 
Rel(g,D,D_P) = DA(g,D) \times LOD(g,D_P)
\end{equation}

 Where $DA(g,D)$ is the DA score of the gene $g$ with regard to the disease $D$ and $LOD(g,D_P)$ is the LOD score of the gene $g$ in relation with the sub-pathologies $D_P$ in muticlass problems and relative to $D$ in binary problems. 
 

 
### Disease Association score.
Disease Association (DA) score between a gene and a certain disease is based on scores obtained by Open Targets Platform. To obtain DA score, a request to this platform's API is performed, in order to receive related genes with a certain disease. This result is compared with candidates genes available for each problem, so scores for common genes are kept.
This value is in range 0 to 1, where 1 means a strong relation between the gene and the diseases, while 0 means no relation. 


### LOD score.

LOD score, or B-statistic, is obtained through limma package. It is a statistical indicator of the probability that a gene is differentially expressed given two health states. It is calculated as the logarithm of the ratio between the probability that the gene is differentially expressed for the disease, and the probability that it is not expressed [3,4]. In multiclass problems LOD score is calculated for each possible healthy state, so the mean of all of them is calculated in order to obtain a single score of each gene.
Since this score is given by a logarithm, it can take any real value, so these are normalized in the same range as the DA score, this is, between 0 and 1.

## Redundancy.

Redundance is calculated based on evidences that join a gene with a certain disease, in which are based DA scores and which are obtained by web queries through KnowSeq package. Redundancy of the gene $g$ over the gene $g_i$ in relation with a certain disease $D$, is calculated by the proportion of $g$ evidences that can be found in $g_i$ evidences, either partially or completely. In other words, the percentage of evidences that $g$ shares with $g_i$. Since DA score is calculated from found evidences, the redundancy of the $g$ over the $g_i$ in relation with $D$ is defined as the proportion of the DA score of $g$ that is explained through $g_i$. This is, following the next equation.

\begin{equation}
RED(g,g_i,D) = \frac{\textrm{Num. of evidences of $g$ in $g_i$}}{\textrm{Num. of g evidences}} \times DA(g,D)
\end{equation}


Thus, $ 0 \leq RED(g,g_i,D) \leq DA(g,D) \leq 1$. A redundancy of 0 means that there is no evidences that links $g$ and $D$ which also links $g_i$ and $D$, so both DA scores are independent. Conversely a redundancy of $DA(g,D)$ indicates that all found evidences of $g$ in relation to $D$, also relates $g_i$ to $D$, this is that $g$ DA score can be fully explained through $g_i$.


# Usage

To show how the DA-RED-LOD feature selector works, an example of use will be computed using two possible types of cancer multiclass data. For this example, kidney or lung cancer can be selected by setting the following variable.

In [1]:
cancer.type <- 'lung'

In [2]:
data.train <- read.table(paste('data/',cancer.type,'/expression-train-multiclass.csv',sep=''),sep='\t')
labels.train <- read.table(paste('data/',cancer.type,'/labels-train-multiclass.csv',sep=''),sep='\t')$x

In [3]:
source('featureSelection.R')

Loading required package: quantreg

Loading required package: SparseM


Attaching package: ‘SparseM’


The following object is masked from ‘package:base’:

    backsolve


Loading required package: mclust

Package 'mclust' version 5.4.6
Type 'citation("mclust")' for citing this R package in publications.

Loading required package: topGO

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following object is masked from ‘package:limma’:

    plotMA


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated,

The function create for computing the feaute selection process receives an expression matrix with the samples in the rows and the genes in the columns, the samples labels and the disease the user wants to study. *Mode* parameter indicates which FS method is used, options are: da, daRed or daLOD. The user can also select the number of genes to extract using the parameter *maxGenes*. Finally, *returnEvidences* is a boolean parameter which indicates if found evidences of selected genes are returned or not.

In this example, 20 genes are going to be selected in relation with selected *cancer.type* diseases by using the three possible methods, and found evidences are going to be returned. 

In [4]:
featureRankingDA <- featureSelection(data.train,labels.train,disease=cancer.type,mode='da',maxGenes=10,returnEvidences=TRUE)

Calculating ranking of biological relevant genes by using DA implementation...
Disease Association ranking: GABRA3 ROS1 ZBTB16 HOXC11 SFTPD SCN7A HMGA2 MMP3 ABCA3 SFTPC
Obtaining related diseases with the DEGs from targetValidation platform...
Evidences acquired successfully!


In [5]:
featureRankingDARED <- featureSelection(data.train,labels.train,disease=cancer.type,mode='daRed',maxGenes=10,returnEvidences=TRUE)

Calculating ranking of biological relevant genes by using DA-Red implementation...
Disease Association ranking: GABRA3 ROS1 ZBTB16 HOXC11 SFTPD SCN7A HMGA2 MMP3 ABCA3 SFTPC
Obtaining related diseases with the DEGs from targetValidation platform...
Evidences acquired successfully!
Calculating genes scores...


In [6]:
featureRankingDALOD <- featureSelection(data.train,labels.train,disease=cancer.type,mode='daLOD',maxGenes=10,returnEvidences=TRUE)

Disease Association ranking: GABRA3 ROS1 ZBTB16 HOXC11 SFTPD SCN7A HMGA2 MMP3 ABCA3 SFTPC
Obtaining related diseases with the DEGs from targetValidation platform...
Evidences acquired successfully!
Calculating genes LOD scores...
More than two classes detected, applying limma multiclass
Contrasts: adeno-scc
 Contrasts: adeno-STN
 Contrasts: scc-STN
Calculating genes scores...


## Ranking

Below it is shown the ranking of selected genes and computed scores by each FS mode: DA, DA-RED and DA-RED-LOD

In [7]:
data.frame(featureRankingDA$ranking)

Unnamed: 0_level_0,featureRankingDA.ranking
Unnamed: 0_level_1,<dbl>
GABRA3,1
ROS1,1
ZBTB16,1
HOXC11,1
SFTPD,1
SCN7A,1
HMGA2,1
MMP3,1
ABCA3,1
SFTPC,1


In [8]:
data.frame(unlist(featureRankingDARED$ranking))

Unnamed: 0_level_0,unlist.featureRankingDARED.ranking.
Unnamed: 0_level_1,<dbl>
GABRA3.GABRA3,1.0
ABCA3,1.0
ROS1,0.9980284
SFTPD,0.9829535
MMP3,0.9564779
APOBEC3B,0.9535354
SFTPC,0.947092
HMGA2,0.9384921
MMP12,0.8637369
CHRNB4,0.8460661


In [9]:
data.frame(unlist(featureRankingDALOD$ranking))

Unnamed: 0_level_0,unlist.featureRankingDALOD.ranking.
Unnamed: 0_level_1,<dbl>
TFAP2A,0.6088828
TICRR,0.4832688
CHRNB4,0.4782881
MMP12,0.4900457
E2F7,0.4110653
ABCA3,0.3775388
APOBEC3B,0.3278171
GABRA3,0.3523289
CDC45,0.3590342
SFTPC,0.2695595


## Interpretability

Let's check found evidences for the first selected gene by each FS

In [10]:
names(featureRankingDA$evidences)[1]
featureRankingDA$evidences[[1]]

Drug.Name,Molecule.Type
<list>,<list>
PROPOFOL,Small molecule
MIDAZOLAM,Small molecule
SEVOFLURANE,Small molecule

Url,Reactome.Url
<list>,<list>
http://europepmc.org/abstract/MED/28179366,http://www.reactome.org/PathwayBrowser/#R-HSA-5696395
http://europepmc.org/abstract/MED/28179366,http://www.reactome.org/PathwayBrowser/#R-HSA-354192

Url,Comparison
<list>,<list>
http://europepmc.org/abstract/MED/27699219,'tumor tissue' vs 'non-malignant tissue'
http://europepmc.org/abstract/MED/20522636,'lung cancer' vs 'normal' in 'lung; Fresh-frozen tissue'

Url
<list>
http://europepmc.org/abstract/MED/23617850
http://europepmc.org/abstract/MED/27081042
http://europepmc.org/abstract/MED/19048400
http://europepmc.org/abstract/MED/25089631


In [11]:
names(featureRankingDARED$evidences)[1]
featureRankingDARED$evidences[[1]]

Drug.Name,Molecule.Type
<list>,<list>
PROPOFOL,Small molecule
MIDAZOLAM,Small molecule
SEVOFLURANE,Small molecule

Url,Reactome.Url
<list>,<list>
http://europepmc.org/abstract/MED/28179366,http://www.reactome.org/PathwayBrowser/#R-HSA-5696395
http://europepmc.org/abstract/MED/28179366,http://www.reactome.org/PathwayBrowser/#R-HSA-354192

Url,Comparison
<list>,<list>
http://europepmc.org/abstract/MED/27699219,'tumor tissue' vs 'non-malignant tissue'
http://europepmc.org/abstract/MED/20522636,'lung cancer' vs 'normal' in 'lung; Fresh-frozen tissue'

Url
<list>
http://europepmc.org/abstract/MED/23617850
http://europepmc.org/abstract/MED/27081042
http://europepmc.org/abstract/MED/19048400
http://europepmc.org/abstract/MED/25089631


In [12]:
names(featureRankingDALOD$evidences)[1]
featureRankingDALOD$evidences[[1]]

Url,Reactome.Url
<list>,<list>
http://europepmc.org/abstract/MED/28179366,http://www.reactome.org/PathwayBrowser/#R-HSA-3134975
http://europepmc.org/abstract/MED/28179366,http://www.reactome.org/PathwayBrowser/#R-HSA-6802949
http://europepmc.org/abstract/MED/28179366,http://www.reactome.org/PathwayBrowser/#R-HSA-202403

Is.Associated,Specie
<named list>,<named list>
False,mouse

Url,Comparison
<list>,<list>
http://europepmc.org/abstract/MED/27699219,'tumor tissue' vs 'non-malignant tissue'
http://europepmc.org/abstract/MED/20522636,'lung cancer' vs 'normal' in 'lung; Fresh-frozen tissue'
http://europepmc.org/abstract/MED/20878980,'non-small cell lung cancer' vs 'normal'
http://europepmc.org/abstract/MED/20522636,'lung cancer' vs 'normal' in 'lung; Formalin-fixed paraffin-embedded tissue'
http://europepmc.org/abstract/MED/20802022,'primary tumor' vs 'adjacent normal tissue'
*,'squamous cell carcinoma' vs 'normal tissue adjacent to squamous cell carcinoma'
http://europepmc.org/abstract/MED/15653641,'lung adenocarcinoma' vs 'normal'

Url
<list>
http://europepmc.org/abstract/MED/16145912
http://europepmc.org/abstract/MED/25050743
http://europepmc.org/abstract/MED/25294805
http://europepmc.org/abstract/MED/25481043
http://europepmc.org/abstract/MED/29100274
http://europepmc.org/abstract/MED/15864740
http://europepmc.org/abstract/MED/17237224
http://europepmc.org/abstract/MED/31337972
http://europepmc.org/abstract/MED/32015686
http://europepmc.org/abstract/MED/28749936


## Classification

In [13]:
data.test <- t(read.table(paste('data/',cancer.type,'/expression-test-multiclass.csv',sep=''),sep='\t'))
labels.test <- read.table(paste('data/',cancer.type,'/labels-test-multiclass.csv',sep='')sep='\t')$x

ERROR: Error in parse(text = x, srcfile = src): <text>:2:90: unexpected symbol
1: data.test <- t(read.table(paste('data/',cancer.type,'/expression-test-multiclass.csv',sep=''),sep='\t'))
2: labels.test <- read.table(paste('data/',cancer.type,'/labels-test-multiclass.csv',sep='')sep
                                                                                            ^


In [None]:
test.results.da <- knn_test(data.train,labels.train,data.test,labels.test,names(featureRankingDA$ranking))
test.results.daRed <- knn_test(data.train,labels.train,data.test,labels.test,names(featureRankingDARED$ranking))
test.results.daLOD <- knn_test(data.train,labels.train,data.test,labels.test,names(featureRankingDALOD$ranking))

In [None]:
# Plot
data <- data.frame('da'=test.results.da$accVector,'daRed'=test.results.daRed$accVector,
                   'daLod'=test.results.daLOD$accVector,'ngenes'=seq(1,length(da.results$accVector)))
ggplot(data = data,x=ngenes) +
geom_line(aes(x=ngenes,y=da,colour = 'da')) +
geom_line(aes(x=ngenes,y=daRed,colour = 'daRed')) +
geom_line(aes(x=ngenes,y=daLod,colour = 'daLod')) +
labs (y = 'Accuracy',x='Numer of genes') + ggtitle ('Accuracy')

# References

    1. Carvalho-Silva, D., Pierleoni, A., Pignatelli, M., Ong, C., Fumis, L., Karamanis, N., ... & Miranda, A. (2019). Open Targets Platform: new developments and updates two years on. Nucleic acids research, 47(D1), D1056-D1065.
    2. Koscielny, G., An, P., Carvalho-Silva, D., Cham, J. A., Fumis, L., Gasparyan, R., ... & Pierleoni, A. (2017). Open Targets: a platform for therapeutic target identification and validation. Nucleic acids research, 45(D1), D985-D994.
    3. Ritchie, M. E., Phipson, B., Wu, D. I., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research, 43(7), e47-e47
    4. Nyholt, D. R. (2000). All LODs are not created equal. The American Journal of Human Genetics, 67(2), 282-288.